Person Identification in Webcam Images:
An Application of Semi-Supervid Learning
Maria-Florina Balcan NINAMF@CS.CMU.EDU Avrim Blum AVRIM@CS.CMU.EDU Patrick Pakyan Choi PAKYAN@CS.CMU.EDU John Lafferty LAFFERTY@CS.CMU.EDU Brian Pantano BPANTANO@ANDREW.CMU.EDU Mugizi Robert Rwebangira RWEBA@CS.CMU.EDU Xiaojin Zhu ZHUXJ@CS.CMU.EDU School of Computer Science,Carnegie Mellon University,Pittsburgh,PA15213USA
Abstract
An application of mi-supervid learning is
made to the problem of person identification in
怎样才能当瑜伽教练low quality webcam images.Using a t of im-
ages of ten people collected over a period of four
months,the person identification task is pod
as a graph-bad mi-supervid learning prob-
lem,where only a few training images are la-
beled.The importance of domain knowledge
in graph construction is discusd,and experi-
ments are prented that clearly show the advan-
tage of mi-supervid learning over standard
supervid learning.The data ud in the study
is available to the rearch community to encour-
age further investigation of this problem.
1.Introduction
The School of Computer Science at Carnegie Mellon Uni-versity has a public lounge,where leftover
pizza and other food items from various meetings converge,to the delight of students,staff,and faculty.To help monitor the pres-ence of food in the lounge,a webcam,sometimes called the FreeFoodCam1,is mounted in a coke machine and trained upon the table where food is placed.After being spotted on the webcam,the arrival of(almost)fresh free food is heralded with instant messages nt throughout the School. The FreeFoodCam offers interesting opportunities for u.edu/˜coke,Carnegie Mellon University internal
Appearing in Proc.of the22st ICML Workshop on Learning with Partially Classified Training Data,Bonn,Germany,2005.Copy-right2005by the author(s)/owner(s).arch in mi-supervid machine learning.This paper prents an investigation of the problem of person identi-fication in this low quality video data,using webcam im-ages of ten people that were collected over a period of v-eral months.The results highlight the importance of do-main knowledge in mi-supervid learning,and clearly demonstrate the advantages of using both labeled and unla-beled data over standard supervid learning.
结果英语In recent years,there has been a substantial amount of work exploring how best to incorporate unlabeled data into su-pervid learning(Zhu,2005).Several mi-supervid learning approaches have been propod for practical ap-plications in different areas,such as information retrieval, text cl
assification(Nigam et al.,1998),and bioinformat-ics(Weston et al.,2004;Shin et al.,2004).In the context of computer vision,veral interesting results have been obtained for object detection.Levin et al.(2003)intro-duced a technique bad on co-training(Blum&Mitchell, 1998)forfitting visual detectors in a way that requires only a small quantity of labeled data,using unlabeled data to improve performance over time.Ronberg et al.(2005) prent a mi-supervid approach to training object de-tection systems bad on lf-training,and perform exten-sive experiments with a state-of-the-art detector(Schnei-derman&Kanade,2002;Schneiderman,2004a;Schnei-derman,2004b)demonstrating that a model trained in this manner can achieve results comparable to a model trained in the traditional manner using a much larger t of fully labeled data.
In this work,we describe a new application of mi-supervid learning to the problem of person identification in webcam images,where the video stream has a low frame rate,and the images are of low quality.Significantly,many of the images may have no face,as the person could be fac-ing away from the camera.We discuss the creation of the
Figure 1.Four typical FreeFoodCam images.
datat,and the formulation of the mi-supervid learn-ing problem.The task of face recognition,of cour,has an extensive literature;e (Zhao et al.,2003)for a sur-vey.However,to the best of our knowledge,person identi-fication in video data has not been previously attacked us-ing mi-supervid learning methods.Relatively primitive image processing techniques are ud in our work;we note that more sophisticated computer vision techniques can be easily incorporated into the framework,and should only improve the performance.But the spirit of our contribution is to argue that mi-supervid learning methods may be attractive as a complementary tool to advanced image pro-cessing.The data we have developed and that forms the basis for the experiments reported here will be made avail-able to the rearch community.2
2.The FreeFoodCam Datat
The datat consists of 5254images with one and only one person in it.Figure 1shows four typical images from the data.The task is not trivial:
•The images of each person were captured on multi-ple days during a four month period.People changed
2
Instructions for obtaining the datat can be found at u.edu/˜zhuxj/freefoodcam .
clothes,hair styles,and one person even grew a beard.We simulate a video surveillance scenario where im-ages for a group of people are manually labeled in a few beginning frames,and the people must be recog-nized on later days.Therefore we choo labeled data within the first day of a person’s appearance,and test on the remaining images of the day and all other days.This is much more difficult than testing only on the same day,or allowing labeled data to come from all days.
•The FreeFoodCam is a low quality webcam.Each frame has 640×480resolution so faces of far away people are small.The frame rate is a little over 0.5frames per cond,and lighting in the lounge is com-plex and changing.•A person could turn their face away from the camera,and roughly one third of the images contain no face at all.Since only a few images are labeled,and all of the test im-ages are available,the task is a natural candidate for the application of mi-supervid learning techniques.
date10/2411/131/61/141/201/211/27
1128193153474
2256193448
3288305593
4204190394
52664118919515
619534179104512
7126163200180702228789
81896617211715559
918994215693043640
1065143122330
total184139883111963842763285254 Figure2.Left:mean background image ud for background subtraction.Right:breakdown of the10subjects by date.
2.1.Data Collection
We asked ten volunteers to appear in ven FreeFoodCam
takes over four months.Not all participants could show up
for every take.The FreeFoodCam is located in the Com-
puter Science lounge,but we received a live camera feed
in our office,and took images from the camera whenever a
new frame was available.
In each take,the participants took turns entering the scene,
北京羽毛球培训walking around,and“acting naturally,”for example by
reading the newspaper or chatting with off-camera col-
leagues,forfive to ten minutes per take.As a result,we
collected images where the individuals have varying pos
and are at a range of distances from the camera.We dis-
carded all frames that were corrupted by electronic noi in
the coke machine,or that contained more than one person
in the scene.This latter constraint impod was to make
the task simple to specify as afirst step;there is no reason
that the methods we prent below could not be extended
to work with scenes containing multiple people.
2.2.Foreground Color Extraction
To accurately capture the color information of an individual
in the image,bad primarily on their clothing,we had to
parate him or her from the background.As computer
vision is not the focus of the work,we ud only primitive
image processing methods.
A simple background subtraction algorithm was ud to
find the foreground.We computed the per-pixel means
and variances of red,green and blue channels from294
background images.Figure2shows the mean background.
Using the means and variances of the background,we ob-
tained the foreground area in each image by thresholding.
上海外滩介绍Pixels deviating more than three standard derivations from
the mean were treated as foreground.
To improve the quality of the foreground color histogram,
we procesd the foreground area using morphological
transforms(Jain,1989).Further processing was required
becau the foreground derived from background subtrac-
tion often captured only part of the body and contained
background areas.Wefirst removed small islands in the
foreground by applying the open operation with a7pixel-
wide square.We then connected vertically-parated pixel
blocks(such as head and lower torso)using the clo opera-
tion with a60-pixel-by-10-pixel rectangular block.Finally,
we made sure the foreground contains the entire person by
enlarging the foreground to include neighboring pixels by
further closing the foreground with a disk of20pixels in
radius.And becau there is only one person in each im-
age,we discarded all but the largest contiguous block of
pixels in the procesd foreground.Figure3shows some
procesd foreground images.钱用英语怎么说
After this processing the foreground area is reprented
by a100-dimensional vector,which consists of a50-bin
hue histogram,a30-bin saturation histogram,and a20-bin
brightness histogram.
2.3.Face Image Extraction
The face of the person is stored as a small image,which
全身皮肤美白的方法
is derived from the outputs of a face detector(Schneider-
man2004a;2004b).Note that this is not a face recognizer
(a face recognizer was not ud for this task).It simply de-
tects the prence of frontal or profile faces,and outputs the
estimated center and radius of the detected face.We took a
职称英语词汇square area around the center as the face image.If no face
was detected,the face image is empty.Figure4shows a
few face images as determined by the face detector.
2.4.Summary of the Datat
In summary,the datat is comprid of5254images for
ten individuals,collected during ven takes over four
months.There is a slight imbalance in the class distribu-
holi
Figure3.Examples of foregrounds extracted by background subtraction and morphological
transforms.
Figure4.Examples of face images detected by the face detector.
tion,and only a subt of individuals are prent in each day(refer to Table2for the breakdown).Over
all34%of the images(1808out of5254)do not contain a face. Each image in the datat is reprented by three features:
Time:The date and time the image was taken.
Color histogram of procesd foreground:A100di-mensional vector consisting of three histograms of the foreground pixels,a50-bin hue histogram,a30-bin saturation histogram,and a20-bin brightness his-togram.
Face image:A square color image of the face(if prent).
As mentioned above,this feature is missing in about 34%of the images.3.The Graphs
Graph-bad mi-supervid learning depends critically on the construction and quality of the graph.The graph should reflect domain knowledge through the similarity function that is ud to assign edges(and their weights). For the FreeFoodCam data the nodes in the graph are the images.An edge is formed between two images according to the following criteria:
1.Time edges.People normally move around in the
lounge at moderate speed,thus adjacent frames are likely to contain the same person.We reprent this knowledge in the graph by putting an edge between two images if their time difference is less than a threshold t1(usually a few conds).
honolulu
image2910neighbor1:time edge neighbor2:color
edge
neighbor3:color edge neighbor4:color edge neighbor5:face edge
Figure5.A random image and its neighbors in the graph.
2.Color edges.The color histogram is largely deter-
mined by a person’s apparel.We assume people
change clothes on different days,so that the color
histogram tends to be unusable across multiple days.
However,it is an informative feature during a shorter
time period(t2),such as half a day.In the graph for
every image i,wefind the t of images having a time
difference between(t1,t2)to i,and connect i with its
k c-nearest neighbors(in terms of cosine similarity on
histograms)in the t.The parameter k c is a small
integer,such as three.
3.Face edges.We u face similarity over longer time
spans.For every image i with a face,wefind the t
of images more than t2apart from i,and connect i
with its k f-nearest neighbor in the t.We u pixel-
wi Euclidean distance between face images,where
the pair of face images is scaled to the same size.
Thefinal graph is the union of the three kinds of edges.The
edges are unweighted.We ud t1=2conds,t2=12
hours,k c=3and k f=1below.Conveniently,the
parameters result in a connected graph.
It is impossible to visualize the whole graph.Instead,we
show the neighbors of a random node in Figure5.
4.Algorithms
We u the simple Gaussianfield and harmonic function
algorithm(Zhu et al.,2003)on the FreeFoodCam datat.
Let l be the number of labeled images,u the number of
unlabeled images,and n=l+u.The graph is reprented
the n×n weight matrix W.Let D be the diagonal degree
matrix with D ii= j W ij,and define the combinatorial
Laplacian
L=D−W(1)
Let Y l be an l×C label matrix,where C=10is the number
of class.For l,Y l(i,c)=1if labeled image i
is in class c,Y l(i,c)=0otherwi.Then the harmonic
function solution for the unlabeled data is
Y u=−L−1uu L ul Y l(2)
where L uu is the submatrix of L on unlabeled nodes and
so on.Each row of Y u can be interpreted as the collection
of posterior probabilities p(y i=c|Y l)for C and
i∈U.Classification is carried out byfinding the class with
the maximal posterior in each row.
In(Zhu et al.,2003)it has also been shown that incor-
porating class proportion knowledge can be helpful.The
proportion q c of data with label c can be estimated from
the labeled t.In particular,the class mass normalization
(CMN)heuristic scales the posteriors to meet the propor-
tions.That is,onefinds a t of coefficients a1,...,a C
such that
a1 i∈U Y u(i,1):···:a C i∈U Y u(i,C)=q1:···:q C
(3)
face
−
→
time
−
→
color
−
→
Figure 6.An example “gradient walk”on the graph.The walk starts from an unlabeled image,through assorted edges,and ends at a labeled image.
Classification of an unlabeled point i is achieved by finding argmax c a c Y u (i,c ).In the experiments below we report the accuracy of both the harmonic function and CMN.4.1.Gradient Walks on the Graph
The harmonic algorithm described above solves a t of lin-ear equations so that the predicted label of each example is the average of the predicted labels of its unlabeled neigh-bors and the actual labels of its labeled neighbors.The “reasons”for the algorithm’s predictions can (roughly)be visualized by performing a “gradient walk”starting from an unlabeled example i ,always moving to the neighbor with the highest score given to the predicted label.That is,let y be the predicted label for i .If we are at node j ,we will walk to j ’s neighbor node k if
商务英语翻译k =argmax k ∼j Y u (k ,y )
(4)
The gradient walk continues until we reach a labeled ex-ample.Two gradient walk paths are shown in Figure 6and Figure 7.
5.Experimental Results
We evaluated harmonic functions on the FreeFoodCam tasks.For each task we gradually incread the labeled t size systematically,performed 30random trials for each la-beled t size.In each trial we randomly sampled a labeled t with the specified size from the first day of a person’s appearance only .This is becau we wanted to simulate
a video surveillance scenario,where people are tagged and identified on later days.It is more difficult and more real-istic than sampling labeled data from the entire datat.If a class was missing from the sampled labeled t,we redid the random sampling.The remaining images are ud as the unlabeled t.
We report the classification accuracies with harmonic func-tions and CMN,on two different graphs.The first graph is constructed with parameters t 1=2conds,t 2=12hours,k c =3,k f =1,the cond with k c =1.The results are prented in Figure 8.
To compare the graph-ba mi-supervid learning meth-ods against a standard supervid learning method,we ud a Matlab implementation of support vector ma-chines (Gunn,1997)as the baline.For C -class multi-class problems,we ud a one-against-all scheme which creates C binary subproblems,one for each class against all the other class,and lect the class with the largest margin.Becau we have missing features on face sub-images,the kernel for the SVM baline requires special care.We ud an interpolated linear kernel K (i,j )=w t K t (i,j )+w c K c (i,j )+w f K f (i,j ),where K t ,K c ,K f are linear kernels (inner products)on time stamp,color his-togram,and face sub-image (normalized to 50×50pix-els)respectively.If image i contains no face,we define K f (i,·)=0.The interpolation weights w t ,w c ,w f were optimized with cross validation.Notice the SVMs with such kernel are not mi-supervid:the unlabeled data are merely ud as test data.We found that the harmonic