Real Time Head Po Estimation from
Consumer Depth Cameras
Gabriele Fanelli1,Thibaut Wei2,Juergen Gall1and Luc Van Gool1,3
1ETH Zurich,Switzerland2EPFL Lausanne,Switzerland3KU Leuven,Belgium {fanelli,gall,vangool}@hz.ch,thibaut.wei@epfl.ch简笔画牛
Abstract.We prent a system for estimating location and orientation
of a person’s head,from depth data acquired by a low quality device.Our
approach is bad on discriminative random regression forests:enmbles
of random trees trained by splitting each node so as to simultaneously
reduce the entropy of the class labels distribution and the variance of the
head position and orientation.We evaluate three different approaches
to jointly take classification and regression performance into account
during training.For evaluation,we acquired a new datat and propo
a method for its automatic annotation.
1Introduction
Head po estimation is a key element of human behavior analysis.For this reason,many applications would benefit from automatic and robust head po estimation systems.While2D video prents ambiguities hard to resolve in real time,systems relying on3D data have shown very good results[5,10].Such approaches,however,u bulky3D scanners like[22]and are not uful for con-sumer products or mobile applications like robots.Today,cheap depth cameras exist,even though they provide much lower quality data.
We prent an approach for real time3D head po estimation robust to the poor signal-to-noi ratio of current consumer depth cameras.The method is inspired by the recent work of[10]that us random regression forests[9]to estimate the3D head po in real time from high quality depth data.It basically learns a mapping between simple depth features and real-valued parameters such as3D head position and rotation angles.The system achieves very good performance and is robust to occlusions but it assumes that the face is the sole object in thefield of view.We extend the regression
forests such that they discriminate depth patches that belong to a head(classification)and u only tho patches to predict the po(regression),jointly solving the classification and regression problems.In our experiments,we evaluate veral schemes that can be ud to optimize both the discriminative power as well as the regression accuracy of such a random forest.In order to deal with the characteristic noi level of the nsor,we cannot rely on synthetic data as in[10],but we have to acquire real training ,faces captured with a similar nsor.We therefore recorded veral subjects and their head movements,annotating the data by tracking each quence using a personalized template.
2G.Fanelli,T.Wei,J.Gall and L.Van Gool
Our system works on a frame-by-frame basis,needs no initialization,and runs in real time.In our experiments,we show that it can handle large po changes and variations such as facial hair and partial occlusions.
2Related Work
The literature contains veral works on head po estimation,which can be conveniently divided depending on whether they u2D images or depth data.
Among the algorithms bad on2D images,we can further distinguish be-tween appearance-bad methods,which analyze the whole face region,and feature-bad methods,which rely on the localization of specific facial features, e.g.,the eyes.Examples of appearance-bad methods are[13]and[17],where the head po space is discretized and parate detectors are learned for each gment.Statistical generative ,active appearance models[8]and their variations[7,19,2],are very popular in the face analysisfield,but are rarely employed for head po estimation.Feature-bad methods are limited by their need to either have the same facial features visible across different pos,or define po-dependent features[24,16].In general,all2D image-bad methods suffer from veral problems,in particular changes in illumination and identity, and rather textureless regions of the face.
With the recent increasing availability of depth-nsing technologies,a few notable works have shown the ufulness of the depth for solving the problem of head po estimation,either as unique cue[5,10],or in combination with 2D image data[6,20].Breitenstein et al.[5]developed a real time system capa-ble of handling large head po variations.Using high quality depth data,the method relies on the assumption that the no is visible.Real time performance is achieved by using the parallel processing power of a GPU.The approach pro-pod in[10]also relies on high quality depth d
ata,but us random regression forests[9]to estimate the head po,reaching real time performance without the aid of parallel computations on the GPU and without assuming any particular facial feature to be visible.While both[10]and[5]consider the ca where the head is the only object prent in thefield of view,we deal with depth images where other parts of the body might be visible and therefore need to discriminate which image patches belong to the head and which don’t.
Random forests[4]and their variants are very popular in computer vision[18, 11,9,14,12]for their capability of handling large training ts,fast execution time,and high generalization power.In[18,11],random forests have been com-bined with the concept of Hough transform for object detection and action recog-nition.The methods u two objective functions for optimizing the classifica-tion and the Hough voting properties of the random forests.While Gall et al.[11] randomly lect which measure to optimize at each node of the trees,Okada[18] propos a joint objective function defined as a weighted sum of the classification and regression measures.In this work,we evaluate veral schemes for integrat-ing two different objective functions including linear weighting[18]and random lection[11].
Real Time Head Po Estimation from Consumer Depth Cameras3
(a)(b)
Fig.1.Simple example of Discriminative Regression Forest a):A patch is nt down to two trees,ending up in a non-head leaf in thefirst ca,thus not producing a vote, and in a head leaf in th
e cond ca,extracting the multivariate Gaussian distribution stored at the leaf.In b),one training depth image is shown.The blue bounding box enclosing the head specifies where to sample positive(green-inside)and negative patches(red-outside).
3Discriminative Random Regression Forests for Head Po Estimation
Decision trees[3]are powerful tools capable of splitting a hard problem into simpler ones,solvable with trivial predictors,and thus achieving highly non-linear mappings.Each node in a tree performs a test,the result of which directs a data sample towards one of the children nodes.The tests at the nodes are chon in order to cluster the training data as to allow good predictions using simple models.Such models are computed and stored at the leaves,bad on the clusters of annotated data which reach them during training.
Forests of randomly trained trees generalize much better and are less nsitive to overfitting than decision trees taken parately[4].Randomness is introduced in the training process,either in the t of training examples provided to each tree,in the t of tests available for optimization at each node,or in both.
范雎说秦王When the task at hand involves both classification and regression,we call Discriminative Random Re
爱企业gression Forests(DRRF)an enmble of trees which allows to simultaneously parate test data into whether they reprent part of the object of interest and,only in the positive cas,vote for the desired real valued variables.A simple DRRF is shown in Figure1(a):The tests at the nodes lead a sample to a leaf,where it is classified.Only if classified positively,the sample retrieves a Gaussian distribution computed at training time and stored at the leaf,which is ud to cast a vote in a multidimensional continuous space.
Our goal is to estimate the3D position of a head and its orientation from low-quality depth images acquired using a commercial,low-cost nsor.Unlike in[10],the head is not the only part of the person visible in the image,therefore the need to classify image patches before letting them vote for the head po.
4G.Fanelli,T.Wei,J.Gall and L.Van Gool
3.1Training
Assuming a t of depth images is available,together with labels indicating head locations and orientations,we randomly lect patches of fixed size from the region of the image containing the head as positives samples,and from outside the head region as negatives.Figure 1(b)shows one of t
he training images we ud (acquisition and annotation is explained in Section 4),with the head region marked in blue,and examples of a positive and negative patch drawn in green,respectively red.
A tree T in the forest T ={T t }is constructed from the t of patches {P i =(I i ,c i ,θi )}sampled from the training images.I i are the depth patches and c i ∈{0,1}are the class labels.The vector θi ={θx ,θy ,θz ,θya ,θpi ,θro }contains the offt between the 3D point falling on the patch’s center and the head center location,and the Euler rotation angles describing the head orientation.
As in [10],we define the binary test at a non-leaf node as t F 1,F 2,τ(I ):
|F 1|−1 q ∈F 1I (q )−|F 2|−1 q ∈F 2
I (q )>τ,(1)
where F 1and F 2are rectangular,asymmetric regions defined within the patch and τis a threshold.Such tests can be efficiently evaluated using integral images.
During training,for each non-leaf node starting from the root,we generate a large pool of binary tests t k by randomly choosing F 1,F 2,and τ.The test which maximizes a specific optimization function is
picked;the data is then split using the lected test and the process iterates until a leaf is created when either the maximum tree depth is reached,or less than a certain number of patches are left.Leaves store two kinds of information:The ratio of positive patches that reached them during training p c =1|P and the multivariate Gaussian distribution computed from the po parameters of the positive patches.
For the problem at hand,we need trees able to both classify a patch as be-longing to a head or not and cast preci votes into the spaces spanned by 3D head locations and orientations.This is the main difference with [10],where the face is assumed to cover most of the image and thus only a regression measure is ud.We thus evaluate the goodness of a split using a classification measure U C P t高清动漫图片
k and a regression measure U R P t k :The former tends to p-arate the patches at each node eking to maximize the discriminative power of the tree,the latter favors regression accuracy.
Similar to [11],we employ a classification measure which,when maximized,tends to parate the patches so that class uncertainty for a split is minimized:
炒泡面的做法
U C P t k =|P L |· c p c |P L ln p c |P L +|P R |· c p c |P R ln p c |P R |P L |+|P R |
,(2)where p c |P is the ratio of patches belonging to class c ∈{0,1}in the t P .
For what concerns regression,we u the information gain defined by [9]:
U R P t k =H (P )−(w L H (P L )+w R H (P R )),(3)
Real Time Head Po Estimation from Consumer Depth Cameras 5
where H (P )is the differential entropy of the t P and w i =L,R is the ratio of patches nt to each child node.
Our labels (the vectors θ)are modeled as realizations of a multivariate ,p (θ|L )=N (θ;θ,Σ).Moreover,as in [10],we assume the covariance matrix to be ,we allow covariance only among offt vectors and among head rotation angles,but not between the two.For the reasons,we can rewrite eq.(3)as:
U R P t k =log (|Σv |+|Σa |)−君陈
i ={L,R }w i log (|Σv i |+|Σa i |),(4)
清明节来源
where Σv and Σa are the covariance matrices of the offts and rotation angles (the two diagonal blocks in Σ).Maximizing Eq.(4)minimizes the determinants of the covariance matrices,thus decreasing regression uncertainty.
The two measures (2)and (4)can be combined in different ways,and we in-vestigate three different approaches.While the method [11]randomly choos be-tween classification and regression at each node,the method [18]us a weighted sum of the two measures,defined as:
arg max k
U C +αmax p c =1|P −t p ,0 U R .(5)
女人吃葛根有什么功效In the above equation,p c =1|P reprents the ratio of positive samples
contained in the t,or purity,t p is an activation threshold,and αa constant weight.When maximizing (5),the optimization is steered by the classification term alone until the purity of positive patches reaches the threshold t p .From that point on,the regression term starts to play an ever important role.
We propo a third way to combine the two measures by removing the acti-vation threshold from (5)
and using as weight an exponential function:
arg max k U C +(1.0−e −d λ)U R ,(6)where d is the depth of the node.In this way,the regression measure is given increasingly higher weight as we descend towards the leaves,with the parameter λspecifying the steepness of the change.
3.2Head po estimation
For estimating the head po from a depth image,we denly extract patches from the image and pass them through the forest.The tests at the nodes guide each patch all the way to a leaf L ,but not all leaves are to be considered for regression;only if p c =1|P =1and trace (Σ)<max v ,with max v an em-pirical value for the maximum allowed variance,the Gaussian p (θ)is taken into account.As in [10],a stride in the sampling of the patches can be introducted in order to find the desired compromi between speed and accuracy of the es-timate.To be able to handle multiple heads and remove outliers,we perform a