Deep Learning Face Reprentation from Predicting 10,000 Class

更新时间:2023-07-08 23:12:59 阅读：评论：0

Deep Learning Face Reprentation from Predicting10,000Class Yi Sun1Xiaogang Wang2Xiaoou Tang1,3

1Department of Information Engineering,The Chine University of Hong Kong

2Department of Electronic Engineering,The Chine University of Hong Kong

3Shenzhen Institutes of Advanced Technology,Chine Academy of Sciences sy011@ie.cuhk.edu.hk xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk

Abstract

This paper propos to learn a t of high-level feature reprentations through deep learning,referred to as Deep hidden IDentity features(DeepID),for face veriﬁcation. We argue that DeepID can be effectively learned through challenging multi-class face identiﬁcation tasks,whilst they can be generalized to other tasks(such as veriﬁcation)and new identities unen in the training t.Moreover,the generalization capability of DeepID increas as more face class are to be predicted at training.DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks(ConvNets).When learned as classiﬁers to recognize about10,000face identities

in the training t and conﬁgured to keep reducing the neuron numbers along the feature extraction hierarchy,the deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons.The propod features are extracted from various face regions to form complementary and over-complete reprentations.Any state-of-the-art classiﬁers can be learned bad on the high-level reprentations for face veriﬁcation.97.45%veriﬁcation accuracy on LFW is achieved with only weakly aligned faces.

1.Introduction

Face veriﬁcation in unconstrained conditions has been studied extensively in recent years[21,15,7,34,17,26, 18,8,2,9,3,29,6]due to its practical applications and the publishing of LFW[19],an extensively reported datat for face veriﬁcation algorithms.The current best-performing face veriﬁcation algorithms typically reprent faces with over-complete low-level features,followed by shallow models[9,29,6].Recently,deep models such as ConvNets[24]have been proved effective for extracting high-level visual features[11,20,14]and are ud for face veriﬁcation[18,5,31,32,36].Huang et al.[18] learned a generative deep model without supervision.

Cai Figure1.An illustration of the feature extraction process.Arrows indicate forward propagation directions.The number of neurons in each layer of the multiple deep ConvNets are labeled beside each layer.The DeepID features are taken from the last hidden layer of each ConvNet,and predict a large number of identity class. Feature numbers continue to reduce along the feature extraction cascade till the DeepID layer.

et al.[5]learned deep nonlinear metrics.In[31],the deep models are supervid by the binary face veriﬁcation target.Differently,in this paper we propo to learn high-level face identity features with deep models through face identiﬁlassifying a training image into one of n identities(n≈10,000in this work).This high-dimensional prediction task is much more challenging than face veriﬁcation,however,it leads to good generalization of the learned feature reprentations.Although learned through identiﬁcation,the features are shown to be effective for face veriﬁcation and new faces unen in the training t.

We propo an effective way to learn high-level over-complete features with deep ConvNets.A high-level illustration of our feature extraction process is shown in Figure1.The ConvNets are learned to classify all the faces available for training by their identities,with the last hidden layer neuron activations as features(referred to as

2014 IEEE Conference on Computer Vision and Pattern Recognition

Deep hidden IDentity features or DeepID).Each ConvNet takes a face patch as input and extracts local low-level features in the bottom layers.Feature numbers continue to reduce along the feature extraction cascade while gradually more global and high-level features are formed in the top layers.Highly compact160-dimensional DeepID is acquired at the end of the cascade that contain rich identity information and directly predict a much larger , 10,000)of identity class.Classifying all the identities simultaneously instead of training binary classiﬁers as in [21,2,3]is bad on two considerations.First,it is much more difﬁcult to predict a training sample into one of many class than to perform binary classiﬁcation.This challenging task can make full u of the super learning capacity of neural networks to extract effective features for face recognition.Second,it implicitly adds a strong regularization to ConvNets,which helps to form shared hidden reprentations that can classify all the identities well.Therefore,the learned high-level featur

es have good generalization ability and do not over-ﬁt to a small subt of training faces.We constrain DeepID to be signiﬁcantly fewer than the class of identities they predict,which is key to learning highly compact and discriminative features. We further concatenate the DeepID extracted from various face regions to form complementary and over-complete rep-rentations.The learned features can be well generalized to new identities in test,which are not en in training, and can be readily integrated with any state-of-the-art face classiﬁ,Joint Bayesian[8])for face veriﬁcation.

Our method achieves97.45%face veriﬁcation accuracy on LFW using only weakly aligned faces,which is almost as good as human performance of97.53%.We also obrve that as the number of training identities increas,the veriﬁcation performance steadily gets improved.Although the prediction task at the training stage becomes more challenging,the discrimination and generalization ability of the learned features increas.It leaves the door wide open for future improvement of accuracy with more training data.

2.Related work

Many face veriﬁcation methods reprent faces by high-dimensional over-complete face descriptors,followed by shallow models.Cao et al.[7]encoded each face image into 26K learning-bas

将来未来ed(LE)descriptors,and then calculated the L2distance between the LE descriptors after PCA.Chen et al.[9]extracted100K LBP descriptors at den facial landmarks with multiple scales and ud Joint Bayesian[8] for veriﬁcation after PCA.Simonyan et al.[29]computed 1.7M SIFT descriptors denly in scale and space,encoded the den SIFT features into Fisher vectors,and learned lin-ear projection for discriminative dimensionality reduction. Huang et al.[17]combined1.2M CMD[33]and SLBP [1]descriptors,and learned spar Mahalanobis metrics for face veriﬁcation.

Some previous studies have further learned identity-related features bad on low-level features.Kumar et al.

[21]trained attribute and simile classiﬁers to detect facial attributes and measure face similarities to a t of reference people.Berg and Belhumeur[2,3]trained classiﬁers to distinguish the faces from two different people.Features are outputs of the learned classiﬁers.They ud SVM classiﬁers,which are shallow structures,and their learned features are still relatively low-level.In contrast,we classify all the identities from the training t simultaneously.More-over,we u the last hidden layer activations as features instead of the classiﬁer outputs.In our ConvNets,the neuron number of the last hidden layer is much smaller than that of the output,which forces the last hidden layer to learn shared hidden reprentations for faces of different people in order to well classify all of them,resulting in highly dis

criminative and compact features with good generalization ability.

A few deep models have been ud for face veriﬁcation or identiﬁcation.Chopra et al.[10]ud a Siame network [4]for deep metric learning.The Siame network extracts features parately from two compared inputs with two identical sub-networks,taking the distance between the outputs of the two sub-networks as dissimilarity.[10] ud deep ConvNets as the sub-networks.In contrast to the Siame network in which feature extraction and recognition are jointly learned with the face veriﬁcation target,we conduct feature extraction and recognition in two steps,with theﬁrst feature extraction step learned with the target of face identiﬁcation,which is a much stronger supervision signal than veriﬁcation.Huang et al.[18] generatively learned features with CDBNs[25],then ud ITML[13]and linear SVM for face veriﬁcation.Cai et al.

[5]also learned deep metrics under the Siame network framework as[10],but ud a two-level ISA network[23] as the sub-networks instead.Zhu et al.[35,36]learned deep neural networks to transform faces in arbitrary pos and illumination to frontal faces with normal illumination,and then ud the last hidden layer features or the transformed faces for face recognition.Sun et al.[31]ud multiple deep ConvNets to learn high-level face similarity features and trained classiﬁcation RBM[22]for face veriﬁcation.Their features are jointly extracted from a pair of faces instead of from a single face.

3.Learning DeepID for face veriﬁcation

3.1.Deep ConvNets

Our deep ConvNets contain four convolutional layers (with max-pooling)to extract features hierarchically,fol-lowed by the fully-connected DeepID layer and the softmax output layer indicating identity class.The input is39×

如何月入过万

Figure 2.ConvNet structure.The length,width,and height of each cuboid denotes the map number and the dimension of each map for all input,convolutional,and max-pooling layers.The inside small c

uboids and squares denote the 3D convolution kernel sizes and the 2D pooling region sizes of convolutional and max-pooling layers,respectively.Neuron numbers of the last two fully-connected layers are marked beside each layer.

31×k for rectangle patches,and 31×31×k for square patches,where k =3for color patches and k =1for gray patches.Figure 2shows the detailed structure of the ConvNet which takes 39×31×1input and predicts n (e.g .,n =10,000)identity class.When the input sizes change,the height and width of maps in the following layers will change accordingly.The dimension of the DeepID layer is ﬁxed to 160,while the dimension of the output layer varies according to the number of class it predicts.Feature numbers continue to reduce along the feature extraction hierarchy until the last hidden layer (the DeepID layer),where highly compact and predictive features are formed,which predict a much larger number of identity class with only a few features.

The convolution operation is expresd as

自己生日祝福y j (r )=max 0,b j (r )+

务工证k ij (r )∗x i (r )

洗车打蜡,(1)where x i and y j are the i -th input map and the j -th output

map,respectively.k ij is the convolution kernel between the i -th input map and the j -th output map.∗denotes convolution.b j is the bias of the j -th output map.We u ReLU nonlinearity (y =max (0,x ))for hidden neurons,which is shown to have better ﬁtting abilities than the sigmoid function [20].Weights in higher convolutional layers of our ConvNets are locally shared to learn different mid-or high-level features in different regions [18].r in Equation 1indicates a local region where weights are shared.In the third convolutional layer,weights are locally shared in every 2×2regions,while weights in the fourth convolutional layer are totally unshared.Max-pooling is formulated as

y i

j,k =max 0≤m,n<s

x i j ·s +m,k ·s +n ,(2)

where each neuron in the i -th output map y i pools over an s ×s non-overlapping local region in the i -th input map x i

Figure 3.Top:ten face regions of medium scales.The ﬁve regions

in the top left are global regions taken from the weakly aligned faces,the other ﬁve in the top right are local regions centered around the ﬁve facial landmarks (two eye centers,no tip,and two mou corners).Bottom:three scales of two particular patches.

The last hidden layer of DeepID is fully connected to both the third and fourth convolutional layers (after max-pooling)such that it es multi-scale features [28](features in the fourth convolutional layer are more global than tho in the third one).This is critical to feature learning becau after successive down-sampling along the cascade,the fourth convolutional layer contains too few neurons and becomes the bottleneck for information propagation.Adding the bypassing connections between the third con-volutional layer (referred to as the skipping layer)and the last hidden layer reduces the possible information loss in the fourth convolutional layer.The last hidden layer takes the function

y j =max

x 1i ·w 1

i,j +

x 2i ·w 2

i,j +b j

,(3)

where x 1,w 1,x 2,w 2denote neurons and weights in the

third and fourth convolutional layers,respectively.It lin-early combines features in the previous two convolutional layers,followed by ReLU non-linearity.

The ConvNet output is an n -way softmax predicting the probability distribution over n different identities.

y i =exp(y i

) n j =1exp(y j )

(4)

where y

= 160i =1x i ·w i,j +b j linearly combines the 160DeepID features x i as the input of neuron j ,and y j is its output.The ConvNet is learned by minimizing −log y t ,with the t -th target class.Stochastic gradient descent is ud with gradients calculated by back-propagation.

3.2.Feature extraction

We detectﬁve facial landmarks,including the two eye centers,the no tip,and the two mouth corners,with the facial point detection method propod by Sun et al.[30]. Faces are globally aligned by similarity transformation according to the two eye centers and the mid-point of the two mouth corners.Features are extracted from60face patches with ten regions,three scales,and RGB or gray c

hannels.Figure3shows the ten face regions and the three scales of two particular face regions.We trained 60ConvNets,each of which extracts two160-dimensional DeepID vectors from a particular patch and its horizontally ﬂipped counterpart.A special ca is patches around the two eye centers and the two mouth corners,which are not ﬂipped themlves,but the patches symmetric with them (for example,theﬂipped counterpart of the patch centered on the left eye is derived byﬂipping the patch centered on the right eye).The total length of DeepID is19,200 (160×2×60),which is ready for theﬁnal face veriﬁcation.

3.3.Face veriﬁcation

We u the Joint Bayesian[8]technique for face ver-iﬁcation bad on the DeepID.Joint Bayesian has been highly successful for face veriﬁcation[9,6].It reprents the extracted facial features x(after subtracting the mean) by the sum of two independent Gaussian variables

x=μ+ ,(5) whereμ∼N(0,Sμ)reprents the face identity and ∼N(0,S )the intra-personal variations.Joint Bayesian models the joint probability of two faces given the intra-or extra-personal variation hypothesis,P(x1,x2|H I)and P(x1,x2|H E).It is readily shown from Equation5that the two probabilities are also Gaussian with variations

ΣI=

Sμ+S Sμ

SμSμ+S

(6)

智能手表哪个牌子性价比高

and工作意愿

ΣE=

Sμ+S 0

0Sμ+S

,(7)

respectively.Sμand S can be learned from data with EM algorithm.In test,it calculates the likelihood ratio

r(x1,x2)=log P(x1,x2|H I)

P(x1,x2|H E),(8)

which has clod-form solutions and is efﬁcient.

We also train a neural network for veriﬁcation and com-pare it to Joint Bayesian to e if other models can also learn from the extracted features and how much the features and a good face veriﬁcation model contribute to the performance, respectively.The neural network contains one input

layer Figure 4.The structure of the neural network ud for face veriﬁcation.The layer type and dimension are labeled beside each layer.The solid neurons form a subnetwork.

taking the DeepID,one locally-connected layer,one fully-connected layer,and a single output neuron indicating face similarities.The input features are divided into60 groups,each of which contains640features extracted from a particular patch pair with a particular ConvNet.Features in the same group are highly correlated.Neurons in the locally-connected layer only connect to a single group of features to learn their local relations and reduce the feature dimension at the same time.The cond hidden layer is fully-connected to theﬁrst hidden layer to learn global relations.The single output neuron is fully connected to the cond hidden layer.The hidden neurons are ReLUs and the output neuron is sigmoid.An illustration of the neural network structure is shown in Figure4.It has38,400input neurons with19,200DeepID features from each patch,and 4,800neurons in the following two hidden layers,with every80neurons in theﬁrst hidden layer locally connected to one of the60groups of input neurons.

Dropout learning[16]is ud for all the hidden neu-rons.The input neurons cannot be dropped becau the learned features are compact and distributed reprenta-tions(reprenting a large number of identities with very few neurons)and have to collaborate with each other to reprent the i

dentities well.On the other hand,learning high-dimensional features without dropout is difﬁcult due to gradient diffusion.To solve this problem,weﬁrst train60 subnetworks,each with features of a single group as input.

A particular subnetwork is illustrated in Figure4.We then u theﬁrst-layer weights of the subnetworks to initialize tho of the original network,and tune the cond and third layers of the original network with theﬁrst layer weights clipped.

4.Experiments

We evaluate our algorithm on LFW,which reveals the state-of-the-art of face veriﬁcation in the wild.Though LFW contains5749people,only85have more than15 images,and4069people have only one image.It is inadequate to train identity classiﬁers with so few images per person.Instead,we trained our model on CelebFaces

[31]and tested on LFW(Section4.1-4.3).CelebFaces contains87,628face images of5436celebrities from the Internet,with approximately16images per person on average.People in LFW and CelebFaces are mutually exclusive.

We randomly choo80%(4349)people from Celeb-Faces to learn the DeepID,and u the remaining20% people to learn the face veriﬁcation model(Joint Bayesian or neural networks).For feature learning,ConvNets are supervid to classify the4349people simultaneously

from a particular kind of face patches and theirﬂipped counterparts.We randomly lect10%images of each training person to generate the validation data.After each training epoch,we obrve the top-1validation t error rates and lect the model that provides the lowest one.

In face veriﬁcation,our feature dimension is reduced to150by PCA before learning the Joint Bayesian model. Performance almost retains in a wide range of dimensions. In test,each face pair is classiﬁed by comparing the Joint Bayesian likelihood ratio to a threshold optimized in the training data.

To evaluate the performance of our approach at an even larger training scale in Section4.4,we extend CelebFaces to the CelebFaces+datat,which contains202,599face images of10,177celebrities.Again,people in LFW and CelebFaces+are mutually exclusive.The ConvNet structure and feature extraction process described in the previous ction remains unchanged.

4.1.Multi-scale ConvNets

茶杯犬一般多少钱一只

We verify the effectiveness of directly connecting neu-rons in the third convolutional layer(after max-pooling) to the last hidden layer(the DeepID layer),such that it es both the third and fourth convolutional layer features, forming the so-called multi-scale ConvNets.It also results in reducing feature numbers from the convolutional layers to the DeepID layer(shown in Figure1),which helps the latter to learn higher-level features in order to well reprent the face identities with fewer neurons.Figure5compares the top-1validation t error rates of the60ConvNets learned to classify the4349class of identities,either with or without the skipping layer.The lower error rates indicate the better hidden features learned.Allowing the DeepID to pool over multi-scale features reduces validation errors by an average of4.72%.It actually also improves theﬁnal face veriﬁcation accuracy from95.35%to96.05%when concatenating the DeepID from the60ConvNets and using Joint Bayesian for face veriﬁcation.

4.2.Learning effective features

Classifying a large number of identities simultaneously is key to learning discriminative and compact hidden features.To verify this,we increa the identity class Figure5.Top-1validation t error rates of the60ConvNets trained on the60different patches.The blue and red markers show error rates of the conventional ConvNets(without the skipping layer)and the multi-scale ConvNets,respectively.

for training exponentially(and output neuron numbers correspondingly)from136to4349whileﬁxing the neuron numbers in all previous layers(the DeepID is kept to be 160dimensional).We obrve the classiﬁcation ability of ConvNets(measured by the top-1validation t error rates) and the effectiveness of the learned hidden reprentations for face veriﬁcation(measured by the test t veriﬁcation accuracy)with the increasing identity class.The input is a single patch covering the whole face in this experiment.As shown in Figure6,both Joint Bayesian and neural network improve linearly in veriﬁcation accuracy when the identity class double.The improvement is signiﬁcant.When identity class increa32times from136to4349,the accuracy increas by10.13%and8.42%for Joint Bayesian and neural networks,respectively,or2.03%and1.68%on average,respectively,whenever the identity class double. At the same time,the validation t error rates drop,even when the predicted class are tens of times more than the last hidden layer neurons,as shown in Figure7.This phenomenon indicates that ConvNets can learn from classi-fying each identity and form shared hidden reprentations that can classify all the identities well.More identity class help to learn better hidden reprentations that can distinguish more people(discriminative)without increasing the feature length(compact).The linear increasing of test accuracy with respect to the exponentially increasing training data indicates that our features would be further improved if even more identities are available.Examples of the160-dimensional DeepID le

arned from the4349training identities and extracted from LFW test pairs are shown in Figure8.Weﬁnd that faces of the same identity tend to have more commonly activated neurons(positive features being in the same position)than tho of different identities. So the learned features extract identity information.

We also test the4349-dimensional classiﬁer outputs as features for face veriﬁcation.Joint Bayesian only achieves approximately66%accuracy on the features,while the neural network fails,where it accounts all the face pairs as

本文发布于:2023-07-08 23:12:59，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1073590.html

上一篇：稀疏表示字典的显示【MATLAB实现】

下一篇：ArcGIS10.2 工具中英文对照表3

标签：牌子洗车意愿

留言与评论（共有 0 条评论）