Low-Resolution Convolutional Neural Networks for Video Face Recognition

Low-Resolution Convolutional Neural Networks for Video Face Recognition Christian Herrmann ,Dieter Willersinn 2and J¨u rgen Beyerer 2,1

KIT Adenauerring 4,76131Karlsruhe,Germany

Fraunhofer IOSB

Fraunhoferstrasse 1,76131Karlsruhe,Germany

{christian.herrmann|dieter.willersinn|juergen.beyerer }@iosb.fraunhofer.de

Abstract

Security and safety applications such as surveillance or forensics demand face recognition in low-resolution video data.We propose a face recognition method based on a Convolutional Neural Network (CNN)with a manifold-based track comparison strategy for low-resolution video face recognition.The low-resolution domain is addressed by adjusting the network architecture to prevent bottlenecks or signi?cant upscaling of face images.The CNN is trained with a combination of a large-scale self-collected video face dataset and large-scale public image face datasets resulting in about 1.4M training images.To handle large amounts of video data and for effective comparison,the CNN face descriptors are compared ef?ciently on track level by lo-cal patch means.Our setup achieves 80.3percent accuracy on a 32×32pixels low-resolution version of the YouTube Faces Database and outperforms local image descriptors as well as the state-of-the-art VGG-Face network [20]in this domain.The superior performance of the proposed method is con?rmed on a self-collected in-the-wild surveil-lance dataset.

1.Introduction

Searching for persons in video data is a common task in surveillance and forensic scenarios.Automatic face recog-nition methods can support this tedious task and conse-quently help to prevent or solve crimes in a fast and ef?cient manner.With this scenario,there come a few additional challenges compared to regular face recognition problems in terms of data quality.It is necessary to handle video data which usually lacks in quality compared to single images.Recently,CNNs have proven very effective in the area of high-resolution (HR)face recognition [20,21,25].In this paper we transfer this success to low-resolution (LR)video face recognition by proposing an appropriate network ar-chitecture and training the network on a mixture of pub-CNN

LMM

descriptor space

track distance

image space

Figure 1.Proposed face recognition strategy including a CNN face descriptor and a manifold derived track distance (LMM).

lic image datasets and a self-collected video dataset.Fig-ure 1shows our proposed system consisting of a CNN to extract face image descriptors and a strategy to match vari-ably sized sets of descriptors from different face tracks.Our focus for the track matching strategy is ef?ciency,because related literature employs an effective but inef?cient many-to-many comparison [20,21,25].Instead,we reduce the necessary comparisons by de?ning a ?xed number of lo-cal patches in the face descriptor set and show that low numbers of patches are suf?cient for a superior compari-son result.The bene?t of our face recognition system is demonstrated on a LR version of the public YouTube Faces Database (YTF)and a self-collected surveillance dataset.Our contributions are three-fold:

?A variation and adaption of existing CNN architectures to the LR face recognition domain including a system-atic architecture optimization and a novel loss function in this context.

?The collection of a large labeled video face dataset to train the proposed CNN.

?Proposing an appropriate matching strategy for CNN face descriptors for video data.

978-1-5090-3811-4/16/$31.00?2016IEEE

IEEE A VSS 2016,August 2016,Colorado Springs,CO,USA

2.Related work

Trying to address LR video data mainly involves two speci?c challenges.First,the lack of spatial data,includ-ing resolution and image degradations such as out-of-focus, motion blur or compression artifacts.Second,one requires a strategy to compare two sets of still image face descriptors originating from two different face tracks.

Image face descriptors.The?rst popular face recognition techniques,such as Eigenfaces[27]or Fisherfaces[4],were holistic representations of the face.Over time,they were su-perseded by descriptors based on local features such as lo-cal binary patterns(LBP)[1]or dense SIFT[22].Recently, these local descriptors appear to become replaced by Deep CNNs,which have proven very effective for HR single im-age face recognition[20,21,24,25].Usually,the network architecture is inspired[21]by or directly derived[20]from a network designed for the ImageNet challenge[7].Con-sequently,the face image is required to have the same or a similar resolution as the ImageNet data,which is in most cases scaled to224×224or256×256pixels.Image data from personal snapshots or professional footage usually in-cludes at least this face size making it a feasible strategy. However,video data and surveillance footage in particular lacks in resolution and would require upscaling to serve as input of such a network.Thus,some of the?ner details a HR network might be focusing on for the recognition are unavailable.Addressing the LR challenges has received lit-tle attention so far when designing and training a CNN and results were mediocre for the few exceptions[8,15].Es-pecially,Schroff et al.[21]reported a drop from86.4%to 37.8%in validation rate in case the face size was reduced from256×256pixels to40×40pixels.

Video face recognition.While a number of elegant ways based on cumulative descriptors such as bag of visual words or FisherVectors[12,19]were recently proposed to han-dle the face track comparison problem,these approaches are impractical for CNN-descriptors,because CNNs yield holistic face image descriptors instead of the local image features required by these methods.This limits the com-parison strategies to the remaining classical categories:set-based,space-based,and manifold-based.Set-based strate-gies select single(best-shot),random[25],speci?c[32] or all[6]elements of a set for a pair-wise comparison. Based on the pair-wise distances,popular choices for the set distance are the minimum or the Hausdorff distance[6]. Space-based methods model the face descriptor space of one track by a linear model,for example by an af?ne sub-space or the convex hull[5,9].In the Mutual Subspace Method(MSM)[9],comparison is then performed by mea-suring the principle angle between the subspaces.Manifold-based methods[2,17,29]choose a more complex non-linear manifold instead of a linear model for descriptor space representation.Table1.Basic structure of the proposed low-resolution CNN.Ker-nel size given as rows×cols×#?lters for conv and rows×cols for pool layers.

layer

group

type kernel

size

stride,

pad

data size out

for32×32

input image

#params for

32×32×1

input image 1

conv3×3×C1,132×32×C9C

relu32×32×C

conv3×3×C1,132×32×C9C2

relu32×32×C

pool P×P2,016×16×C

conv3×3×2C1,116×16×2C18C2

relu16×16×2C

pool P×P2,08×8×2C

conv3×3×4C1,18×8×4C72C2

relu8×8×4C

pool P×P2,04×4×4C

conv3×3×4C1,14×4×4C144C2

relu4×4×4C

pool P×P2,02×2×4C

fc4F64CF

relu4F

maxout2F

fc2F4F2

relu2F

fc F2F2

relu F

9fc128128F

In previous work on CNN face descriptors,a simple pair-wise comparison of face descriptors between face tracks was applied[20,21,25].This strategy has earlier been shown to be very effective[6,30],but is rather inef?cient due to its O(n2)complexity.Thus,the key concept of our proposed face track comparison method is to keep the ef-fectiveness while signi?cantly increasing the ef?ciency.

3.Low resolution CNN

When building a CNN for the LR video scenario,two parts of this process require speci?c consideration.First, the architecture is somewhat limited in its number of lay-ers due to the low resolution.Second,an appropriate train-ing strategy is required,because the face descriptor for each frame should be as compact as possible,to allow ef?cient track handling.

https://www.360docs.net/doc/e517518241.html,work architecture

Roughly speaking,a conventional CNN consists of two types of layer groups,convolutional and fully connected ones.Putting activation function layers and some further tweaks aside,a convolutional group usually consists of a convolutional layer?ltering the input data and a pool-

https://www.360docs.net/doc/e517518241.html,parison of LrfNet size with related networks.

network#params

LeNet[16]0.06M

DeepFace[25]120M

VGG-Face[20]145M

FaceNet[21]7.5-140M

LrfNet(proposed) 5.1-86M

ing layer condensing the output of the convolutional layer. Fully connected groups consist only of a fully connected layer which connects all input and output neurons pair-wise like a conventional Multi Layer Perceptron(MLP)layer. Stacking up some convolutional and afterwards some fully connected layer groups yields a CNN.

The limiting layer type in this setup is the pooling layer because each one reduces the incoming data size at least by half by pooling together at least2×2neurons of the previ-ous convolutional layer.Including too much pooling layers results in a bottleneck between the convolutional groups and the fully connected ones in the case of LR images.Speci?-cally,all local image information will be lost and condensed into one global feature in the extreme case of enough pool-ing layers to reduce the layer output size to1×1.In conclu-sion,this bottleneck has to be large enough to allow neces-sary information to pass.This is also why we cannot adopt recent?ndings about training very deep networks[11,23], but instead are limited to conventional architectures.

Considering this fact and some further?ndings about de-signing a CNN,such as using recti?ed linear units(relu)as activation layer[14]or replacing larger?lters with3×3 ones[20],leads to an adapted low-resolution face network (LrfNet)architecture as shown by table1.Its size lies some-where in between the one of the basic LeNet[16]and state-of-the-art HR face networks such as DeepFace[25],VGG-Face[20]or FaceNet[21](see table2).

Only the parameters and design choices that are forced by the scenario are?xed in the architecture,all further pa-rameters are variable or limited to feasible changes.This includes the number of?lters per convolutional layer(in-?uenced by C),the pooling area size(P×P),number of fully connected neurons(in?uenced by F)as well as the exact number of layer groups of each type,https://www.360docs.net/doc/e517518241.html,yer groups 4,5,7and8are removable.Stride and padding are cho-sen in a way that as little data as possible will be lost in the process.Due to the variable architecture,the number of net-work parameters varies signi?cantly and lies between5.1M and86M for the evaluated settings.

3.2.Training strategy

Similar to[21]we understand the network as a function that maps the input face image to a target space that is dis-criminative for face recognition.As dimension of this em-bedding target space we choose a?xed size of d=128,which has repeatedly proven an appropriate choice for face descriptors[19,21].The training follows a Siamese setup where the loss function minimizes the euclidean distance between face descriptors for positive face pairs(same iden-tity)while maximizing the one for negative pairs(different identity).This can be understood as a network that con-sists of two branches,each one processing one face image of the pair.Each branch has the same network structure previously de?ned by table1.For application after train-ing,only one branch of the two identical branches is kept and it projects a face image into the target descriptor space where comparison to further face descriptors is quickly pos-sible by euclidean distance.As loss function l we suggest a max-margin hinge loss formulation

i,j

max

0,1?y ij·

b?d2(x i,x j)

,(1)

similar to[22],where x i and x j denote the face descrip-tors,y ij={?1,1}the indicator variable,b the decision boundary and d2the squared euclidean distance.The ben-e?t of the max-margin loss function compared to the com-mon contrastive loss[10]for Siamese setups is that it pre-vents pushing the distance of a correctly classi?ed positive pair towards0if classi?cation is easy,i.e.it lies far from the decision border.This can avoid over?tting effects.

4.Fast track comparison strategy

The trained CNN yields one face descriptor for each frame of a face track leading to a sequence of face descrip-tors.The comparison strategy can have a signi?cant impact on both,runtime and recognition performance.Thus,we suggest a method based on a manifold assumption of the face descriptors that ful?lls both criteria.If we assume that the face manifold of the input image data is suf?ciently pre-served by the CNN,then it is valid to locally model the manifold in a linear way by k patches.Local patches of the manifold are found by k-means clustering.Instead of mod-eling the local patches of the manifold by a hyperplane[17] or a single representative[32],each patch will be modeled by the cluster mean as indicated by?gure1.Manifold-manifold distance is computed by pair-wise minimum dis-tance of the local means.This local mean method(LMM) has several bene?ts.First,we will show that a small num-ber of local patches is suf?cient to achieve competitive per-formance which,second,leads to a low comparison effort. Third,the method is tolerant to noise affection caused by outliers due to the averaging of the face descriptors.

5.Training data

All publicly available datasets that have a suf?cient size to train a CNN are single image datasets.Because of this, we create a dataset of TV face tracks(TVC)in addition to

Table3.Statistics of our combined training dataset and its sub-datasets.

dataset#images#samples#identities

FaceScrub51,16251,162451

MSRA163,018163,0181,372

TVC1,152,54515,427604

combined1,366,725229,6072,427

the public FaceScrub[18]and MSRA-CFW(MSRA)[31] datasets to represent the target scenario in the training data. We collect the face tracks from about80hours of local TV program with a Viola-Jones[28]based face tracker.False-positives are automatically removed by a second plausibil-ity stage looking for skin color and suf?cient visible face parts(at least two out of both eyes,nose and mouth),simi-lar to[26].The remaining face tracks are labeled with their identity.By using local TV productions,we avoid an iden-tity overlap with any of the celebrities usually found in pub-lic face recognition datasets,including FaceScrub,MSRA as well as the test dataset YTF[30].We took particularly care that no identity of the test set is present in the train-ing data.This includes eliminating all YTF identities from FaceScrub and MSRA.We considered a YTF identity to be in any of the other datasets if the celebrity names matched with an edit distance of1or less.

Our combined training data consists of more than1M images from about230K samples of2.4K identities,the de-tailed statistics are shown by table3.A sample denotes a cohesive set of face images,i.e.a face track for video data or a single image for still image data.In addition,the vali-dation set includes the about300identities and50K samples removed from FaceScrub and MSRA.

6.Experiments

Using the datasets described in the previous section, we?rst perform an architecture optimization of the pro-posed CNN.Afterwards the optimized version will be com-pared to state-of-the-art face recognition methods on the test datasets.Evaluation follows veri?cation methodology with a10-fold cross-validation,reporting the standard measures for this scenario.The CNNs are trained on a GeForce Ti-tan X using the Caffe framework[13].

6.1.Face image descriptor(CNN)

Because we speci?cally made an open design allowing certain parameters to change,we optimize these on our val-idation dataset.Besides only optimizing the intrinsic pa-rameters such as the number of neurons,we also adjust the number of layer groups of each type.Table4docu-ments this process for a face size of32×32pixels and a training duration of10epochs,where we start from a base-line parameter setting(indicated for each parameter)and Table4.Optimization of the proposed LrfNet architecture on the validation set at32×32pixels face size.Baseline values are in-dicated.Results are mean accuracy and standard deviation for a 10-fold cross-validation.

parameter case value accuracy std color

1grayscale0.7670.009

2RGB0.7570.007

#https://www.360docs.net/doc/e517518241.html,yer

33-(1,2,3,6,...)0.7630.011

44-(1,2,3,4,6,...)0.7670.009

55-(1,2,3,4,5,6,...)0.7590.011

#?lters per

https://www.360docs.net/doc/e517518241.html,yer C

6640.7550.009

71280.7670.009

82560.7810.009

95120.7790.010 max-pooling

size P

1020.7670.009

1130.7680.006

1240.7620.006 #fully

connected

layers

132-(...,6,9)0.7760.009

143-(...,6,7,9)0.7670.009

154-(...,6,7,8,9)0.7350.018

#neurons per

fully connected

layer F

16640.7620.010

171280.7670.009

182560.7810.007

195120.7830.008

2010240.7910.005

2120480.7540.010 loss function

22contrastive0.7530.014

23max-margin0.7670.009

combined24cases1,4,8,

11,13,20,23

0.8020.007

optimize each parameter value.In contradiction to related literature where this optimization process is often intrans-parent,if performed at all,we distinctly show the effects of each parameter.This process clearly documents which variations contribute the most to a successful LR CNN.In this sense,the results indicate that a suf?ciently large,but not too large,number of?lters per convolutional layer(case 6-9),a big enough pooling area(case10-12)and the max-margin loss function(case23)lead to a signi?cant improve-ment in performance.Regarding the fully connected layers, performance increases with the number of neurons(case 16-21)and decreases with the number of fully connected layer groups(case13-15).We suspect both effects having the same reason:the number of neurons in the last layer before the output layer which varies in the proposed archi-tecture when changing the number of fully connected layer groups.Further,changes such as the number of convolu-tional(case3-5)layer groups have only insigni?cant effect for the tested range.Quite interesting to note is that includ-ing color information(case2)has a negative in?uence on performance in this scenario,despite CNNs offering a very elegant way to do https://www.360docs.net/doc/e517518241.html,ing the optimization results,a?-

https://www.360docs.net/doc/e517518241.html,parison of the proposed LrfNet face descriptor and the proposed LMM track descriptor with further descriptors on the YTF dataset at32×32pixels face size.For details refer to text.

track distance value LrfNet VGG-Net[20]dense SIFT

[22]

LBP[1]raw pixel

best shot,cosine accuracy±std0.741±0.0180.691±0.0200.584±0.0210.580±0.0220.551±0.019 AUC|EER0.821|0.2590.751|0.3110.604|0.4360.615|0.4230.572|0.448 best shot,euclidean accuracy±std0.752±0.0180.613±0.0320.584±0.0210.566±0.0230.552±0.021 AUC|EER0.839|0.2460.661|0.3820.604|0.4360.594|0.4340.565|0.455 best shot,Hellinger accuracy±std0.733±0.0150.645±0.0220.584±0.0160.596±0.0210.547±0.022 AUC|EER0.821|0.2650.709|0.3500.611|0.4290.623|0.4200.565|0.456

minset,cosine accuracy±std0.778±0.0170.723±0.0230.655±0.0100.635±0.0140.600±0.024 AUC|EER0.868|0.2180.799|0.2730.701|0.3580.681|0.3720.631|0.408 minset,euclidean accuracy±std0.792±0.0150.641±0.0280.654±0.0100.629±0.0230.570±0.021 AUC|EER0.883|0.2060.697|0.3600.701|0.3580.673|0.3900.599|0.439 minset,Hellinger accuracy±std0.788±0.0190.670±0.0240.656±0.0070.652±0.0170.573±0.015 AUC|EER0.871|0.2160.744|0.3230.708|0.3530.701|0.3570.604|0.434

MSM,cosine accuracy±std0.796±0.0140.752±0.0170.628±0.0170.624±0.0160.583±0.028 AUC|EER0.886|0.2010.827|0.2470.675|0.3830.666|0.3870.608|0.430 MSM,euclidean accuracy±std0.796±0.0160.756±0.0200.628±0.0170.624±0.0140.585±0.025 AUC|EER0.886|0.2010.833|0.2400.675|0.3830.665|0.3880.609|0.427 MSM,Hellinger accuracy±std0.781±0.0240.746±0.0200.622±0.0110.632±0.0210.576±0.028 AUC|EER0.869|0.2190.826|0.2500.670|0.3910.672|0.3830.599|0.437 LMM1,cosine accuracy±std0.796±0.0140.758±0.0180.626±0.0160.624±0.0140.583±0.029 AUC|EER0.886|0.2010.831|0.2400.668|0.3850.670|0.3860.608|0.429 LMM1,euclidean accuracy±std0.803±0.0160.686±0.0190.617±0.0140.609±0.0150.555±0.019 AUC|EER0.894|0.1920.751|0.3160.663|0.3950.647|0.4010.585|0.441 LMM1,Hellinger accuracy±std0.797±0.0180.725±0.0140.617±0.0130.626±0.0160.555±0.022 AUC|EER0.880|0.2000.798|0.2700.670|0.3860.669|0.3870.587|0.450

LMM10,cosine accuracy±std0.789±0.0180.741±0.0170.643±0.0130.640±0.0160.598±0.025 AUC|EER0.877|0.2120.814|0.2550.689|0.3660.680|0.3810.628|0.412 LMM10,euclidean accuracy±std0.803±0.0150.655±0.0240.646±0.0110.632±0.0250.567±0.016 AUC|EER0.889|0.1960.719|0.3440.690|0.3700.667|0.3850.596|0.436 LMM10,Hellinger accuracy±std0.785±0.0220.697±0.0260.650±0.0090.640±0.0160.571±0.012 AUC|EER0.875|0.2140.766|0.3060.696|0.3640.684|0.3680.601|0.434

nal LrfNet is trained combining the best parameter choices (case24).Performing the optimization for different face sizes shows the resolution dependence in?gure2.

6.2.Face track comparison

The optimized LrfNet from the previous section projects a face image to a discriminative face descriptor.When ad-dressing video data,each face track consisting of several face images leads to a set of face descriptors requiring a track comparison strategy.To evaluate the proposed system, we use the YouTube Faces Database following the of?cial evaluation protocol[30].The only difference made is to evaluate a face size of32×32pixels instead of the original resolution which is about100pixels face width.Speci?-cally,we want to note that for all tested methods in table5, no?ne tuning is performed on YTF data,resembling an ac-tual application scenario.

Descriptors.We compare our LrfNet face descriptor with a state-of-the-art HR CNN face descriptor,namely the VGG-Face descriptor[20],as well as the LBP[1]and dense SIFT[22]face descriptors,and raw vectorized pixel data. Distance measures.We evaluate two set-based track dis-tances D,minset(nearest neighbor)

D m(X,Y)=min

x∈X,y∈Y

d(x,y),(2) and best-shot

D b(X,Y)=d(x b,y b)(3) distance,where d denotes a vector distance,X,Y the sets of face descriptors x,y from two face tracks and x b,y b the best face descriptors of the respective face track X,Y in terms of most frontal head pose.MSM[9]as space-based method,together with the set-based ones,is compared to our proposed LMM manifold-based distance

D l(X,Y)=min

i,j=1,...,k

x i,y j

(4)

adapted CNN is important for a good LR performance,be-cause applying the HR VGG-Face network[20]leads to signi?cantly worse results.Another aspect when consid-ering practicality of video face recognition is the choice of the face track distance D.While the minset distance D m shows the best results for the conventional face descrip-tors(raw pixel,LBP,SIFT)in accordance with previous ?ndings[6,30],this is different for CNN face descriptors (LrfNet,VGG-Face).In the context of CNNs,the proposed LMM distance D l with k=1is the favorable choice be-cause matching face tracks is as fast as with the best-shot method,while also resulting in superior performance com-pared to the minset distance D m for the CNN face descrip-tors.An increased k up to k≈20has a positive effect for the conventional face descriptors,but a negligible one for the CNN descriptors(see?gure3).If the patch size k is chosen too large,performance decreases.All in all, this means that our proposed system achieves superior per-formance with a face track descriptor of only128dimen-sions where comparison can be ef?ciently performed by eu-clidean distance.

6.3.Surveillance data

Evaluation for an application scenario is performed with an in-the-wild surveillance dataset recorded on three differ-

our

of 168face tracks of71people.Face sizes are mostly in the range of20to40pixels and the track length varies from5 to about2,300frames,some examples are included in?g-ure1.Again,following the veri?cation setup and evaluating the most promising combinations from table5shows again the superiority of our proposed LrfNet in table6.

7.Conclusion

Proposing a Convolutional Neural Network for the low-resolution video face recognition domain achieved a veri-?cation accuracy of80.3percent on a low-resolution ver-sion of the YTF dataset,beating high-resolution CNN and conventional face descriptors.This was achieved by using an adjusted architecture and a max-margin based loss func-tion to train the network with a combination of public and self-collected training data.It was shown that CNN face de-scriptors require a noise resistant track comparison strategy for full exploitation of their potential which is quite contrary to the?ndings for previous face descriptors.Altogether,we showed that the performance leap brought by deep neural networks is transferable to low-resolution face recognition and allows to build a compact face track descriptor.

References

[1]T.Ahonen, A.Hadid,and M.Pietikainen.Face De-

scription with Local Binary Patterns:Application to Face Recognition.Pattern Analysis and Machine Intelligence, 28(12):2037–2041,2006.2,5

[2]O.Arandjelovi′c and R.Cipolla.A pose-wise linear illu-

mination manifold model for face recognition using video.

Computer Vision and Image Understanding,113(1):113–125,2009.2

[3]R.Arandjelovic and A.Zisserman.Three things everyone

should know to improve object retrieval.In Computer Vision and Pattern Recognition,pages2911–2918,2012.6

[4]P.Belhumeur,J.Hespanha,and D.Kriegman.Eigenfaces vs.

Fisherfaces:Recognition Using Class Speci?c Linear Pro-jection.IEEE Transactions on Pattern Analysis and Machine Intelligence,19(7):711–720,1997.2

[5]H.Cevikalp and B.Triggs.Face recognition based on image

sets.In Computer Vision and Pattern Recognition,2010.2 [6]S.Chen,S.Mau,M.T.Harandi,C.Sanderson,A.Bigdeli,

and B.C.Lovell.Face Recognition from Still Images to Video Sequences:A Local-feature-based Framework.

EURASIP Journal on Image and Video Processing,2011.2, 6

[7]J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,and L.Fei-

Fei.Imagenet:A large-scale hierarchical image database.In Computer Vision and Pattern Recognition,pages248–255.

IEEE,2009.2

[8]S.Duffner.Face image analysis with convolutional neu-

ral networks.PhD thesis,University of Freiburg,Germany, 2008.2

[9]K.Fukui and O.Yamaguchi.Face Recognition Using Multi-

viewpoint Patterns for Robot Vision.Robotics Research, pages192–201,2005.2,5

[10]R.Hadsell,S.Chopra,and Y.LeCun.Dimensionality reduc-

tion by learning an invariant mapping.In Computer vision and pattern recognition,volume2,pages1735–1742.IEEE, 2006.3

[11]K.He,X.Zhang,S.Ren,and J.Sun.Deep Residual Learning

for Image Recognition.arXiv preprint arXiv:1512.03385, 2015.3

[12] C.Herrmann and J.Beyerer.Face Retrieval on Large-Scale

Video Data.In Computer and Robot Vision,pages192–199, 2015.2

[13]Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir-

shick,S.Guadarrama,and T.Darrell.Caffe:Convolutional Architecture for Fast Feature Embedding.arXiv preprint arXiv:1408.5093,2014.4

[14] A.Krizhevsky,I.Sutskever,and G.E.Hinton.Imagenet

classi?cation with deep convolutional neural networks.In Advances in neural information processing systems,pages 1097–1105,2012.3

[15]https://www.360docs.net/doc/e517518241.html,wrence,C.L.Giles,A.C.Tsoi,and A.D.Back.

Face recognition:A convolutional neural-network approach.

IEEE Transactions on Neural Networks,8(1):98–113,1997.

2[16]Y.LeCun,L.Bottou,Y.Bengio,and P.Haffner.Gradient-

based learning applied to document recognition.Proceed-ings of the IEEE,86(11):2278–2324,1998.3

[17]K.Lee,J.Ho,M.Yang,and D.Kriegman.Video-Based

Face Recognition Using Probabilistic Appearance https://www.360docs.net/doc/e517518241.html,puter Vision and Pattern Recognition,1:313–320, 2003.2,3

[18]H.-W.Ng and S.Winkler.A data-driven approach to clean-

ing large face datasets.In International Conference on Image Processing,pages343–347.IEEE,2014.4

[19]O.M.Parkhi,K.Simonyan,A.Vedaldi,and A.Zisserman.

A Compact and Discriminative Face Track Descriptor.In

Computer Vision and Pattern Recognition,2014.2,3 [20]O.M.Parkhi,A.Vedaldi,and A.Zisserman.Deep face

recognition.British Machine Vision Conference,1(3):6, 2015.1,2,3,5,6

[21] F.Schroff,D.Kalenichenko,and J.Philbin.FaceNet:A

Uni?ed Embedding for Face Recognition and Clustering.In Computer Vision and Pattern Recognition,pages815–823, 2015.1,2,3,6

[22]K.Simonyan,O.M.Parkhi,A.Vedaldi,and A.Zisserman.

Fisher vector faces in the wild.In British Machine Vision Conference,volume1,page7,2013.2,3,5

[23]R.K.Srivastava,K.Greff,and J.Schmidhuber.Training

very deep networks.In Advances in Neural Information Pro-cessing Systems,pages2368–2376,2015.3

[24]Y.Sun,X.Wang,and X.Tang.Deeply learned face represen-

tations are sparse,selective,and robust.In Computer Vision and Pattern Recognition,pages2892–2900,2015.2 [25]Y.Taigman,M.Yang,M.Ranzato,and L.Wolf.Deepface:

Closing the gap to human-level performance in face veri?-cation.In Computer Vision and Pattern Recognition,pages 1701–1708,2014.1,2,3

[26]M.Tapaswi,C.C.Corez,M.B¨a uml,H.K.Ekenel,and

R.Stiefelhagen.Cleaning up after a face tracker:False posi-tive removal.In International Conference on Image Process-ing,pages253–257.IEEE,2014.4

[27]M.Turk and A.Pentland.Eigenfaces for Recognition.Jour-

nal of Cognitive Neuroscience,3(1):71–86,1991.2 [28]P.Viola and M.J.Jones.Robust real-time face detection.

International Journal of Computer Vision,57(2):137–154, 2004.4

[29]M.E.Wibowo,D.Tjondronegoro,L.Zhang,and I.Hi-

mawan.Heteroscedastic probabilistic linear discriminant analysis for manifold learning in video-based face recogni-tion.In Workshop on Applications of Computer Vision,pages 46–52,2013.2

[30]L.Wolf,T.Hassner,and I.Maoz.Face recognition in un-

constrained videos with matched background similarity.In Computer Vision and Pattern Recognition,2011.2,4,5,6 [31]X.Zhang,L.Zhang,X.-J.Wang,and H.-Y.Shum.Finding

celebrities in billions of web images.IEEE Transactions on Multimedia,14(4):995–1007,2012.4

[32]M.Zhao,J.Yagnik,H.Adam,and https://www.360docs.net/doc/e517518241.html,rge Scale

Learning and Recognition Of Faces in Web Videos.In Au-tomatic Face and Gesture Recognition,pages1–7,2008.2, 3