Deep Learning Face Reprentation from Predicting10,000Class Yi Sun1Xiaogang Wang2Xiaoou Tang1,3
1Department of Information Engineering,The Chine University of Hong Kong
2Department of Electronic Engineering,The Chine University of Hong Kong
3Shenzhen Institutes of Advanced Technology,Chine Academy of Sciences sy011@ie.cuhk.edu.hk xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk
Abstract
This paper propos to learn a t of high-level feature reprentations through deep learning,referred to as Deep hidden IDentity features(DeepID),for face verification. We argue that DeepID can be effectively learned through challenging multi-class face identification tasks,whilst they can be generalized to other tasks(such as verification)and new identities unen in the training t.Moreover,the generalization capability of DeepID increas as more face class are to be predicted at training.DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks(ConvNets).When learned as classifiers to recognize about10,000face identities
in the training t and configured to keep reducing the neuron numbers along the feature extraction hierarchy,the deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons.The propod features are extracted from various face regions to form complementary and over-complete reprentations.Any state-of-the-art classifiers can be learned bad on the high-level reprentations for face verification.97.45%verification accuracy on LFW is achieved with only weakly aligned faces.
1.Introduction
Face verification in unconstrained conditions has been studied extensively in recent years[21,15,7,34,17,26, 18,8,2,9,3,29,6]due to its practical applications and the publishing of LFW[19],an extensively reported datat for face verification algorithms.The current best-performing face verification algorithms typically reprent faces with over-complete low-level features,followed by shallow models[9,29,6].Recently,deep models such as ConvNets[24]have been proved effective for extracting high-level visual features[11,20,14]and are ud for face verification[18,5,31,32,36].Huang et al.[18] learned a generative deep model without supervision.
Cai Figure1.An illustration of the feature extraction process.Arrows indicate forward propagation directions.The number of neurons in each layer of the multiple deep ConvNets are labeled beside each layer.The DeepID features are taken from the last hidden layer of each ConvNet,and predict a large number of identity class. Feature numbers continue to reduce along the feature extraction cascade till the DeepID layer.
et al.[5]learned deep nonlinear metrics.In[31],the deep models are supervid by the binary face verification target.Differently,in this paper we propo to learn high-level face identity features with deep models through face identifilassifying a training image into one of n identities(n≈10,000in this work).This high-dimensional prediction task is much more challenging than face verification,however,it leads to good generalization of the learned feature reprentations.Although learned through identification,the features are shown to be effective for face verification and new faces unen in the training t.
We propo an effective way to learn high-level over-complete features with deep ConvNets.A high-level illustration of our feature extraction process is shown in Figure1.The ConvNets are learned to classify all the faces available for training by their identities,with the last hidden layer neuron activations as features(referred to as
2014 IEEE Conference on Computer Vision and Pattern Recognition
Deep hidden IDentity features or DeepID).Each ConvNet takes a face patch as input and extracts local low-level features in the bottom layers.Feature numbers continue to reduce along the feature extraction cascade while gradually more global and high-level features are formed in the top layers.Highly compact160-dimensional DeepID is acquired at the end of the cascade that contain rich identity information and directly predict a much larger , 10,000)of identity class.Classifying all the identities simultaneously instead of training binary classifiers as in [21,2,3]is bad on two considerations.First,it is much more difficult to predict a training sample into one of many class than to perform binary classification.This challenging task can make full u of the super learning capacity of neural networks to extract effective features for face recognition.Second,it implicitly adds a strong regularization to ConvNets,which helps to form shared hidden reprentations that can classify all the identities well.Therefore,the learned high-level featur
es have good generalization ability and do not over-fit to a small subt of training faces.We constrain DeepID to be significantly fewer than the class of identities they predict,which is key to learning highly compact and discriminative features. We further concatenate the DeepID extracted from various face regions to form complementary and over-complete rep-rentations.The learned features can be well generalized to new identities in test,which are not en in training, and can be readily integrated with any state-of-the-art face classifi,Joint Bayesian[8])for face verification.
Our method achieves97.45%face verification accuracy on LFW using only weakly aligned faces,which is almost as good as human performance of97.53%.We also obrve that as the number of training identities increas,the verification performance steadily gets improved.Although the prediction task at the training stage becomes more challenging,the discrimination and generalization ability of the learned features increas.It leaves the door wide open for future improvement of accuracy with more training data.
2.Related work
Many face verification methods reprent faces by high-dimensional over-complete face descriptors,followed by shallow models.Cao et al.[7]encoded each face image into 26K learning-bas
将来未来ed(LE)descriptors,and then calculated the L2distance between the LE descriptors after PCA.Chen et al.[9]extracted100K LBP descriptors at den facial landmarks with multiple scales and ud Joint Bayesian[8] for verification after PCA.Simonyan et al.[29]computed 1.7M SIFT descriptors denly in scale and space,encoded the den SIFT features into Fisher vectors,and learned lin-ear projection for discriminative dimensionality reduction. Huang et al.[17]combined1.2M CMD[33]and SLBP [1]descriptors,and learned spar Mahalanobis metrics for face verification.
Some previous studies have further learned identity-related features bad on low-level features.Kumar et al.
[21]trained attribute and simile classifiers to detect facial attributes and measure face similarities to a t of reference people.Berg and Belhumeur[2,3]trained classifiers to distinguish the faces from two different people.Features are outputs of the learned classifiers.They ud SVM classifiers,which are shallow structures,and their learned features are still relatively low-level.In contrast,we classify all the identities from the training t simultaneously.More-over,we u the last hidden layer activations as features instead of the classifier outputs.In our ConvNets,the neuron number of the last hidden layer is much smaller than that of the output,which forces the last hidden layer to learn shared hidden reprentations for faces of different people in order to well classify all of them,resulting in highly dis
criminative and compact features with good generalization ability.
A few deep models have been ud for face verification or identification.Chopra et al.[10]ud a Siame network [4]for deep metric learning.The Siame network extracts features parately from two compared inputs with two identical sub-networks,taking the distance between the outputs of the two sub-networks as dissimilarity.[10] ud deep ConvNets as the sub-networks.In contrast to the Siame network in which feature extraction and recognition are jointly learned with the face verification target,we conduct feature extraction and recognition in two steps,with thefirst feature extraction step learned with the target of face identification,which is a much stronger supervision signal than verification.Huang et al.[18] generatively learned features with CDBNs[25],then ud ITML[13]and linear SVM for face verification.Cai et al.
[5]also learned deep metrics under the Siame network framework as[10],but ud a two-level ISA network[23] as the sub-networks instead.Zhu et al.[35,36]learned deep neural networks to transform faces in arbitrary pos and illumination to frontal faces with normal illumination,and then ud the last hidden layer features or the transformed faces for face recognition.Sun et al.[31]ud multiple deep ConvNets to learn high-level face similarity features and trained classification RBM[22]for face verification.Their features are jointly extracted from a pair of faces instead of from a single face.
3.Learning DeepID for face verification
3.1.Deep ConvNets
Our deep ConvNets contain four convolutional layers (with max-pooling)to extract features hierarchically,fol-lowed by the fully-connected DeepID layer and the softmax output layer indicating identity class.The input is39×
如何月入过万
Figure 2.ConvNet structure.The length,width,and height of each cuboid denotes the map number and the dimension of each map for all input,convolutional,and max-pooling layers.The inside small c
uboids and squares denote the 3D convolution kernel sizes and the 2D pooling region sizes of convolutional and max-pooling layers,respectively.Neuron numbers of the last two fully-connected layers are marked beside each layer.
31×k for rectangle patches,and 31×31×k for square patches,where k =3for color patches and k =1for gray patches.Figure 2shows the detailed structure of the ConvNet which takes 39×31×1input and predicts n (e.g .,n =10,000)identity class.When the input sizes change,the height and width of maps in the following layers will change accordingly.The dimension of the DeepID layer is fixed to 160,while the dimension of the output layer varies according to the number of class it predicts.Feature numbers continue to reduce along the feature extraction hierarchy until the last hidden layer (the DeepID layer),where highly compact and predictive features are formed,which predict a much larger number of identity class with only a few features.
The convolution operation is expresd as
自己生日祝福y j (r )=max 0,b j (r )+
i
务工证k ij (r )∗x i (r )
洗车打蜡,(1)where x i and y j are the i -th input map and the j -th output
map,respectively.k ij is the convolution kernel between the i -th input map and the j -th output map.∗denotes convolution.b j is the bias of the j -th output map.We u ReLU nonlinearity (y =max (0,x ))for hidden neurons,which is shown to have better fitting abilities than the sigmoid function [20].Weights in higher convolutional layers of our ConvNets are locally shared to learn different mid-or high-level features in different regions [18].r in Equation 1indicates a local region where weights are shared.In the third convolutional layer,weights are locally shared in every 2×2regions,while weights in the fourth convolutional layer are totally unshared.Max-pooling is formulated as
y i
j,k =max 0≤m,n<s
x i j ·s +m,k ·s +n ,(2)
where each neuron in the i -th output map y i pools over an s ×s non-overlapping local region in the i -th input map x i
.
Figure 3.Top:ten face regions of medium scales.The five regions
in the top left are global regions taken from the weakly aligned faces,the other five in the top right are local regions centered around the five facial landmarks (two eye centers,no tip,and two mou corners).Bottom:three scales of two particular patches.
The last hidden layer of DeepID is fully connected to both the third and fourth convolutional layers (after max-pooling)such that it es multi-scale features [28](features in the fourth convolutional layer are more global than tho in the third one).This is critical to feature learning becau after successive down-sampling along the cascade,the fourth convolutional layer contains too few neurons and becomes the bottleneck for information propagation.Adding the bypassing connections between the third con-volutional layer (referred to as the skipping layer)and the last hidden layer reduces the possible information loss in the fourth convolutional layer.The last hidden layer takes the function
y j =max
0,
i
x 1i ·w 1
i,j +
i
x 2i ·w 2
i,j +b j
,(3)
where x 1,w 1,x 2,w 2denote neurons and weights in the
third and fourth convolutional layers,respectively.It lin-early combines features in the previous two convolutional layers,followed by ReLU non-linearity.
The ConvNet output is an n -way softmax predicting the probability distribution over n different identities.
y i =exp(y i
) n j =1exp(y j )
,
(4)
where y
j
= 160i =1x i ·w i,j +b j linearly combines the 160DeepID features x i as the input of neuron j ,and y j is its output.The ConvNet is learned by minimizing −log y t ,with the t -th target class.Stochastic gradient descent is ud with gradients calculated by back-propagation.
3.2.Feature extraction
We detectfive facial landmarks,including the two eye centers,the no tip,and the two mouth corners,with the facial point detection method propod by Sun et al.[30]. Faces are globally aligned by similarity transformation according to the two eye centers and the mid-point of the two mouth corners.Features are extracted from60face patches with ten regions,three scales,and RGB or gray c
hannels.Figure3shows the ten face regions and the three scales of two particular face regions.We trained 60ConvNets,each of which extracts two160-dimensional DeepID vectors from a particular patch and its horizontally flipped counterpart.A special ca is patches around the two eye centers and the two mouth corners,which are not flipped themlves,but the patches symmetric with them (for example,theflipped counterpart of the patch centered on the left eye is derived byflipping the patch centered on the right eye).The total length of DeepID is19,200 (160×2×60),which is ready for thefinal face verification.
3.3.Face verification
We u the Joint Bayesian[8]technique for face ver-ification bad on the DeepID.Joint Bayesian has been highly successful for face verification[9,6].It reprents the extracted facial features x(after subtracting the mean) by the sum of two independent Gaussian variables
x=μ+ ,(5) whereμ∼N(0,Sμ)reprents the face identity and ∼N(0,S )the intra-personal variations.Joint Bayesian models the joint probability of two faces given the intra-or extra-personal variation hypothesis,P(x1,x2|H I)and P(x1,x2|H E).It is readily shown from Equation5that the two probabilities are also Gaussian with variations
ΣI=
Sμ+S Sμ
SμSμ+S
(6)
智能手表哪个牌子性价比高
and工作意愿
ΣE=
Sμ+S 0
0Sμ+S
,(7)
respectively.Sμand S can be learned from data with EM algorithm.In test,it calculates the likelihood ratio
r(x1,x2)=log P(x1,x2|H I)
P(x1,x2|H E),(8)
which has clod-form solutions and is efficient.
We also train a neural network for verification and com-pare it to Joint Bayesian to e if other models can also learn from the extracted features and how much the features and a good face verification model contribute to the performance, respectively.The neural network contains one input
layer Figure 4.The structure of the neural network ud for face verification.The layer type and dimension are labeled beside each layer.The solid neurons form a subnetwork.
taking the DeepID,one locally-connected layer,one fully-connected layer,and a single output neuron indicating face similarities.The input features are divided into60 groups,each of which contains640features extracted from a particular patch pair with a particular ConvNet.Features in the same group are highly correlated.Neurons in the locally-connected layer only connect to a single group of features to learn their local relations and reduce the feature dimension at the same time.The cond hidden layer is fully-connected to thefirst hidden layer to learn global relations.The single output neuron is fully connected to the cond hidden layer.The hidden neurons are ReLUs and the output neuron is sigmoid.An illustration of the neural network structure is shown in Figure4.It has38,400input neurons with19,200DeepID features from each patch,and 4,800neurons in the following two hidden layers,with every80neurons in thefirst hidden layer locally connected to one of the60groups of input neurons.
Dropout learning[16]is ud for all the hidden neu-rons.The input neurons cannot be dropped becau the learned features are compact and distributed reprenta-tions(reprenting a large number of identities with very few neurons)and have to collaborate with each other to reprent the i
dentities well.On the other hand,learning high-dimensional features without dropout is difficult due to gradient diffusion.To solve this problem,wefirst train60 subnetworks,each with features of a single group as input.
A particular subnetwork is illustrated in Figure4.We then u thefirst-layer weights of the subnetworks to initialize tho of the original network,and tune the cond and third layers of the original network with thefirst layer weights clipped.
4.Experiments
We evaluate our algorithm on LFW,which reveals the state-of-the-art of face verification in the wild.Though LFW contains5749people,only85have more than15 images,and4069people have only one image.It is inadequate to train identity classifiers with so few images per person.Instead,we trained our model on CelebFaces
[31]and tested on LFW(Section4.1-4.3).CelebFaces contains87,628face images of5436celebrities from the Internet,with approximately16images per person on average.People in LFW and CelebFaces are mutually exclusive.
We randomly choo80%(4349)people from Celeb-Faces to learn the DeepID,and u the remaining20% people to learn the face verification model(Joint Bayesian or neural networks).For feature learning,ConvNets are supervid to classify the4349people simultaneously
from a particular kind of face patches and theirflipped counterparts.We randomly lect10%images of each training person to generate the validation data.After each training epoch,we obrve the top-1validation t error rates and lect the model that provides the lowest one.
In face verification,our feature dimension is reduced to150by PCA before learning the Joint Bayesian model. Performance almost retains in a wide range of dimensions. In test,each face pair is classified by comparing the Joint Bayesian likelihood ratio to a threshold optimized in the training data.
To evaluate the performance of our approach at an even larger training scale in Section4.4,we extend CelebFaces to the CelebFaces+datat,which contains202,599face images of10,177celebrities.Again,people in LFW and CelebFaces+are mutually exclusive.The ConvNet structure and feature extraction process described in the previous ction remains unchanged.
4.1.Multi-scale ConvNets
茶杯犬一般多少钱一只
We verify the effectiveness of directly connecting neu-rons in the third convolutional layer(after max-pooling) to the last hidden layer(the DeepID layer),such that it es both the third and fourth convolutional layer features, forming the so-called multi-scale ConvNets.It also results in reducing feature numbers from the convolutional layers to the DeepID layer(shown in Figure1),which helps the latter to learn higher-level features in order to well reprent the face identities with fewer neurons.Figure5compares the top-1validation t error rates of the60ConvNets learned to classify the4349class of identities,either with or without the skipping layer.The lower error rates indicate the better hidden features learned.Allowing the DeepID to pool over multi-scale features reduces validation errors by an average of4.72%.It actually also improves thefinal face verification accuracy from95.35%to96.05%when concatenating the DeepID from the60ConvNets and using Joint Bayesian for face verification.
4.2.Learning effective features
Classifying a large number of identities simultaneously is key to learning discriminative and compact hidden features.To verify this,we increa the identity class Figure5.Top-1validation t error rates of the60ConvNets trained on the60different patches.The blue and red markers show error rates of the conventional ConvNets(without the skipping layer)and the multi-scale ConvNets,respectively.
for training exponentially(and output neuron numbers correspondingly)from136to4349whilefixing the neuron numbers in all previous layers(the DeepID is kept to be 160dimensional).We obrve the classification ability of ConvNets(measured by the top-1validation t error rates) and the effectiveness of the learned hidden reprentations for face verification(measured by the test t verification accuracy)with the increasing identity class.The input is a single patch covering the whole face in this experiment.As shown in Figure6,both Joint Bayesian and neural network improve linearly in verification accuracy when the identity class double.The improvement is significant.When identity class increa32times from136to4349,the accuracy increas by10.13%and8.42%for Joint Bayesian and neural networks,respectively,or2.03%and1.68%on average,respectively,whenever the identity class double. At the same time,the validation t error rates drop,even when the predicted class are tens of times more than the last hidden layer neurons,as shown in Figure7.This phenomenon indicates that ConvNets can learn from classi-fying each identity and form shared hidden reprentations that can classify all the identities well.More identity class help to learn better hidden reprentations that can distinguish more people(discriminative)without increasing the feature length(compact).The linear increasing of test accuracy with respect to the exponentially increasing training data indicates that our features would be further improved if even more identities are available.Examples of the160-dimensional DeepID le
arned from the4349training identities and extracted from LFW test pairs are shown in Figure8.Wefind that faces of the same identity tend to have more commonly activated neurons(positive features being in the same position)than tho of different identities. So the learned features extract identity information.
We also test the4349-dimensional classifier outputs as features for face verification.Joint Bayesian only achieves approximately66%accuracy on the features,while the neural network fails,where it accounts all the face pairs as