论⽂研读——5.FaceNetAUnifiedEmbeddingforFaceRecogn。。。
⽂章⽬录
Authors and Publishment
Authors
Florian Schroff / Google Inc.
Dmitry Kalenichenko / Google Inc.
James Philbin / Google Inc.
Bibtex
Schroff F, Kalenichenko D, Philbin J. Facenet: A unified embedding for face recognition and clustering, Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 815-823.
Categories
Computer Graphics, Deep Learning, Object Detection
0. Abstract
Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale prents rious challenges to current approaches. In this paper we prent a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly
correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
尽管⼈脸识别领域最近取得了重⼤进展 [10, 14, 15, 17],但⼤规模有效地实施⼈脸验证和识别对当前⽅法提出了严峻挑战。 在本⽂中,我们提出了⼀个名为 FaceNet 的系统,它直接学习从⼈脸图像到紧凑欧⼏⾥得空间的映射,其中距离直接对应于⼈脸相似性的度量。 ⼀旦产⽣了这个空间,就可以使⽤带有 FaceNet 嵌⼊作为特征向量的标准技术轻松实现⼈脸识别、验证和聚类等任务。
r i p 什么意思啊Our method us a deep convolutional network trained to directly optimize the embedding itlf, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we u triplets of roughly aligned matching / non-matching face patches generated using a novel online
triplet mining method. The benefit of our approach is much greater reprentational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.
我们使⽤了经过训练的深度卷积⽹络来直接优化嵌⼊层本⾝,⽽不是像以前的深度学习⽅法那样使⽤中间瓶颈层。 为了训练,我们使⽤了⼀种全新的⽤于⽣成的简易对齐匹配 / ⾮匹配的⾯部块的在线三元组⽅法。该⽅法的好处是更⾼的效率:我们达成了识别每张⼈脸仅需要128字节。
On the widely ud Labeled Faces in the Wild (LFW) datat, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by 30% on both datats.
在⼴泛使⽤的野外⼈脸标记 (LFW) 数据集上,我们的系统达到了 99.63% 的准确率。 在 YouTube Faces DB 上,它达到了
95.12%。 与已发表的结果[15]相⽐,我们的错误率下降了30%。
sja1. Introduction
In this paper we prent a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among the faces). Our method is bad o
n learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.
在本⽂中,我们提出了⼀个集成的⽅法,⽤于⼈脸验证(它属于同⼀个⼈)、识别(这个⼈是谁)和聚类(在这些⼈脸中找到普通⼈)。 我们的⽅法基于深度卷积⽹络学习每张图像的欧⼏⾥得嵌⼊。 训练⽹络使得嵌⼊空间中的平⽅ L2 距离直接对应于⼈脸相似度:同⼀个⼈的⼈脸距离⼩,不同⼈的⼈脸距离⼤。
Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a KNN classification problem; and clustering can be achieved using off-the- shelf techniques such as k-means or agglomerative clustering.
⼀旦产⽣了这个嵌⼊,那么前⾯提到的任务就变得简单了:⼈脸验证只需要设定⼀个阈值即可⽐对两张⼈脸的嵌⼊值; 识别变成了KNN 分类问题;⽽聚类则可通过已有的技术(例如 k 均值或凝聚聚类)来实现。
Previous face recognition approaches bad on deep networks u a classification layer [15, 17] trained over a t of known face identities and then take an intermediate bottleneck layer as a reprentation ud to generalize recognition beyond the t of identities ud in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck reprentation generalizes well to new faces; and by using a bottleneck layer the reprentation size per face is usually very large (1000s of dimensions). Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.
先前基于深度⽹络的⼈脸识别⽅法使⽤在⼀组已知⼈脸⾝份上训练的分类层 [15, 17],然后将中间瓶颈层作为表⽰⽤于泛化训练中使⽤的⾝份组之外的识别。 这种⽅法的缺点是它的间接性和低效率:⼈
们必须希望瓶颈表⽰能够很好地推⼴到新⾯孔; 并且通过使⽤瓶颈层,每个⾯的表⽰尺⼨通常⾮常⼤(1000 维)。 最近的⼀些⼯作 [15] 使⽤ PCA 降低了这个维度,但这是⼀种线性变换,可以在⽹络的⼀层中轻松学习。
In contrast to the approaches, FaceNet directly trains its output to be a compact 128-D embedding using a tripletbad loss function bad on LMNN [19]. Our triplets consist of two matching face thumbnails and a non-matching face
thumbnail and the loss aims to parate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.
与这些⽅法相⽐,FaceNet 使⽤基于 LMNN [19] 的基于三元组的损失函数直接将其输出训练为紧凑的 128 维嵌⼊。 我们的三元组由两个匹配的⼈脸缩略图和⼀个不匹配的⼈脸缩略图组成,损失旨在通过距离间隔将正对与负对分开。 缩略图是⾯部区域的紧密裁剪,除了执⾏缩放和平移之外,没有 2D 或 3D 对齐。
Choosing which triplets to u turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we prent a novel online negative exemplar mining strategy whi
ch ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.狮子王英文版电影
voted选择使⽤哪个三元组对于实现良好的表现⾮常重要,并且在课程学习 [1] 的启发下,我们提出了⼀种新颖的在线负样本挖掘策略,确保随着⽹络训练的不断增加三元组的难度。 为了提⾼聚类精度,我们还探索了硬正挖掘技术,这些技术⿎励⽤于单个⼈嵌⼊的球形集群。
As an illustration of the incredible variability that our method can handle e Figure 1. Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.
作为我们的⽅法可以处理的令⼈难以置信的可变性的说明,请参见图 1。显⽰的是来⾃ PIE [13] 的图像对,这些图像对以前被认为对于⼈脸验证系统来说⾮常困难。
An overview of the rest of the paper is as follows: in ction 2 we review the literature in this area; ction 3.1 defines the triplet loss and ction 3.2 describes our novel triplet lection and training procedure; in ction 3.3 we describe the model architecture ud. Finally in ction 4 and 5 we prent some quantitative results of our embeddings and also qualitatively explore some clustering r
esults.
dig是什么意思>四级真题及答案下载本⽂其余部分的概述如下:在第 2 节中,我们回顾了该领域的⽂献; 第 3.1 节定义了三元组损失,第 3.2 节描述了我们新颖的三元组选择和训练程序; 在 3.3 节中,我们描述了所使⽤的模型架构。 最后,在第 4 节和第 5 节中,我们展⽰了嵌⼊的⼀些定量结果,并定性地探索了⼀些聚类结果。
2. Related Work
Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its reprentation directly from the pixels of the face. Rather than using engineered features, we u a large datat of labelled faces to attain the appropriate invariances to po, illumination, and other variational conditions.
与最近使⽤深度⽹络的其他⼯作类似 [15, 17],我们的⽅法是⼀种纯粹的数据驱动⽅法,它直接从⼈脸的像素中学习其表⽰。 我们没有使⽤⼯程特征,⽽是使⽤⼤量标记⼈脸数据集来获得姿势、光照和其他变化条件的适当不变性。
In this paper we explore two different deep network architectures that have been recently ud to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is bad on the
billy billy
Zeiler&Fergus [22] model which consists of multiple interleaved layers of convolutions, non-linear activations, local respon normalizations, and max pooling layers. We additionally add veral 1×1×d convolution layers inspired by the work of [9]. The cond architecture is bad on the Inception model of Szegedy et al. which was recently ud as the winning approach for ImageNet 2014 [16]. The networks u mixed layers that run veral different convolutional and pooling layers in parallel and concatenate their respons. We have found that the models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.
在本⽂中,我们探讨了最近在计算机视觉社区取得巨⼤成功的两种不同的深度⽹络架构。 两者都是深度卷积⽹络 [8, 11]。 第⼀种架构基于 Zeiler&Fergus [22] 模型,该模型由多个交错的卷积层、⾮线性激活、局部响应归⼀化和最⼤池化层组成。 我们额外添加了⼏个 1×1×d 卷积层,灵感来⾃ [9] 的⼯作。 第⼆种架构基于 Szegedy 等⼈的 Inception 模型。 它最近被⽤作 ImageNet 2014
[16] 的获胜⽅法。 这些⽹络使⽤混合层,这些层并⾏运⾏多个不同的卷积层和池化层,并将它们的响应连接起来。 我们发现这些模
型可以将参数数量减少多达 20 倍,并且有可能减少可⽐性能所需的 FLOPS 数量。
There is a vast corpus of face verification and recognition works. Reviewing it is out of the scope of this paper so we will only briefly discuss the most relevant recent work.
有⼤量的⼈脸验证和识别作品。 回顾它超出了本⽂的范围,因此我们将只简要讨论最近最相关的⼯作。
The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification.
6级听力
[15,17,23] 的作品都采⽤了⼀个多阶段的复杂系统,将深度卷积⽹络的输出与⽤于降维的 PCA 和⽤于分类的 SVM 相结合。
Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an enmble of SVMs is ud.
Zhenyao等 [23] 采⽤深度⽹络将⼈脸“扭曲”成规范的正⾯视图,然后学习 CNN 将每张脸分类为属于已知⾝份。 对于⼈脸验证,使⽤⽹络输出上的 PCA 与 SVM 集成。
Taigman et al. [17] propo a multi-stage approach that aligns faces to a general 3D shape model. A
大使馆英文
multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siame network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97.35%) stems from an enmble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions bad on the χ2 kernel) of tho networks are combined using a non-linear SVM.
Taigman等⼈ [17] 提出了⼀种多阶段⽅法,将⼈脸与⼀般 3D 形状模型对齐。 训练多类⽹络以对超过四千个⾝份执⾏⼈脸识别任务。 作者还对所谓的 Siame ⽹络进⾏了实验,他们直接优化了两个⾯部特征之间的 L1 距离。 他们在 LFW 上的最佳性能
(97.35%) 源于使⽤不同对齐⽅式和颜⾊通道的三个⽹络的集成。 使⽤⾮线性 SVM 组合这些⽹络的预测距离(基于 χ2 核的⾮线性
abnce
SVM 预测)。
Sun et al. [14, 15] propo a compact and therefore relatively cheap to compute network. They u an enmble of 25 of the network, each operating on a different face patch. For their final performance on LFW (99.47% [15]) the authors combine 50 respons (regular and flipped). Both P
CA and a Joint Bayesian model [2] that effectively correspond to a linear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.
Sun等⼈ [14, 15] 提出了⼀种紧凑且相对容易的⽹络。他们使⽤了 25 个这样的⽹络的集合,每个⽹络都在不同的⾯部块上运⾏。对于在 LFW (99.47% [15]) 上的最终表现,作者等⼈结合了 50 个响应(常规和翻转)。采⽤ PCA 和联合贝叶斯模型 [2],它们有效地对应于嵌⼊空间中的线性变换。他们的⽅法不需要明确的 2D/3D 对齐。通过使⽤分类和验证损失的组合来训练⽹络。验证损失类似于我们使⽤的三元组损失 [12, 19],因为它最⼩化了相同⾝份的⼈脸之间的 L2 距离,并在不同⾝份的⼈脸之间的距离之间强制执⾏了裕度。主要区别在于仅⽐较图像对,⽽三元组损失⿎励相对距离约束。
A similar loss to the one ud here was explored in Wang et al. [18] for ranking images by mantic and visual similarity.
Wang等⼈探索了与此处使⽤的损失类似的损失 [18], 通过语义和视觉相似性对图像进⾏排序。