Person identification in webcam images An application of mi-supervid learning. ICML 200

更新时间:2023-06-25 00:44:02 阅读：评论：0

Person Identiﬁcation in Webcam Images:

An Application of Semi-Supervid Learning

Maria-Florina Balcan NINAMF@CS.CMU.EDU Avrim Blum AVRIM@CS.CMU.EDU Patrick Pakyan Choi PAKYAN@CS.CMU.EDU John Lafferty LAFFERTY@CS.CMU.EDU Brian Pantano BPANTANO@ANDREW.CMU.EDU Mugizi Robert Rwebangira RWEBA@CS.CMU.EDU Xiaojin Zhu ZHUXJ@CS.CMU.EDU School of Computer Science,Carnegie Mellon University,Pittsburgh,PA15213USA

Abstract

An application of mi-supervid learning is

made to the problem of person identiﬁcation in

怎样才能当瑜伽教练low quality webcam images.Using a t of im-

ages of ten people collected over a period of four

months,the person identiﬁcation task is pod

as a graph-bad mi-supervid learning prob-

lem,where only a few training images are la-

beled.The importance of domain knowledge

in graph construction is discusd,and experi-

ments are prented that clearly show the advan-

tage of mi-supervid learning over standard

supervid learning.The data ud in the study

is available to the rearch community to encour-

age further investigation of this problem.

1.Introduction

The School of Computer Science at Carnegie Mellon Uni-versity has a public lounge,where leftover

pizza and other food items from various meetings converge,to the delight of students,staff,and faculty.To help monitor the pres-ence of food in the lounge,a webcam,sometimes called the FreeFoodCam1,is mounted in a coke machine and trained upon the table where food is placed.After being spotted on the webcam,the arrival of(almost)fresh free food is heralded with instant messages nt throughout the School. The FreeFoodCam offers interesting opportunities for u.edu/˜coke,Carnegie Mellon University internal

Appearing in Proc.of the22st ICML Workshop on Learning with Partially Classiﬁed Training Data,Bonn,Germany,2005.Copy-right2005by the author(s)/owner(s).arch in mi-supervid machine learning.This paper prents an investigation of the problem of person identi-ﬁcation in this low quality video data,using webcam im-ages of ten people that were collected over a period of v-eral months.The results highlight the importance of do-main knowledge in mi-supervid learning,and clearly demonstrate the advantages of using both labeled and unla-beled data over standard supervid learning.

结果英语In recent years,there has been a substantial amount of work exploring how best to incorporate unlabeled data into su-pervid learning(Zhu,2005).Several mi-supervid learning approaches have been propod for practical ap-plications in different areas,such as information retrieval, text cl

assiﬁcation(Nigam et al.,1998),and bioinformat-ics(Weston et al.,2004;Shin et al.,2004).In the context of computer vision,veral interesting results have been obtained for object detection.Levin et al.(2003)intro-duced a technique bad on co-training(Blum&Mitchell, 1998)forﬁtting visual detectors in a way that requires only a small quantity of labeled data,using unlabeled data to improve performance over time.Ronberg et al.(2005) prent a mi-supervid approach to training object de-tection systems bad on lf-training,and perform exten-sive experiments with a state-of-the-art detector(Schnei-derman&Kanade,2002;Schneiderman,2004a;Schnei-derman,2004b)demonstrating that a model trained in this manner can achieve results comparable to a model trained in the traditional manner using a much larger t of fully labeled data.

In this work,we describe a new application of mi-supervid learning to the problem of person identiﬁcation in webcam images,where the video stream has a low frame rate,and the images are of low quality.Signiﬁcantly,many of the images may have no face,as the person could be fac-ing away from the camera.We discuss the creation of the

Figure 1.Four typical FreeFoodCam images.

datat,and the formulation of the mi-supervid learn-ing problem.The task of face recognition,of cour,has an extensive literature;e (Zhao et al.,2003)for a sur-vey.However,to the best of our knowledge,person identi-ﬁcation in video data has not been previously attacked us-ing mi-supervid learning methods.Relatively primitive image processing techniques are ud in our work;we note that more sophisticated computer vision techniques can be easily incorporated into the framework,and should only improve the performance.But the spirit of our contribution is to argue that mi-supervid learning methods may be attractive as a complementary tool to advanced image pro-cessing.The data we have developed and that forms the basis for the experiments reported here will be made avail-able to the rearch community.2

2.The FreeFoodCam Datat

The datat consists of 5254images with one and only one person in it.Figure 1shows four typical images from the data.The task is not trivial:

•The images of each person were captured on multi-ple days during a four month period.People changed

Instructions for obtaining the datat can be found at u.edu/˜zhuxj/freefoodcam .

clothes,hair styles,and one person even grew a beard.We simulate a video surveillance scenario where im-ages for a group of people are manually labeled in a few beginning frames,and the people must be recog-nized on later days.Therefore we choo labeled data within the ﬁrst day of a person’s appearance,and test on the remaining images of the day and all other days.This is much more difﬁcult than testing only on the same day,or allowing labeled data to come from all days.

•The FreeFoodCam is a low quality webcam.Each frame has 640×480resolution so faces of far away people are small.The frame rate is a little over 0.5frames per cond,and lighting in the lounge is com-plex and changing.•A person could turn their face away from the camera,and roughly one third of the images contain no face at all.Since only a few images are labeled,and all of the test im-ages are available,the task is a natural candidate for the application of mi-supervid learning techniques.

date10/2411/131/61/141/201/211/27

1128193153474

2256193448

3288305593

4204190394

52664118919515

619534179104512

7126163200180702228789

81896617211715559

918994215693043640

1065143122330

total184139883111963842763285254 Figure2.Left:mean background image ud for background subtraction.Right:breakdown of the10subjects by date.

2.1.Data Collection

We asked ten volunteers to appear in ven FreeFoodCam

takes over four months.Not all participants could show up

for every take.The FreeFoodCam is located in the Com-

puter Science lounge,but we received a live camera feed

in our ofﬁce,and took images from the camera whenever a

new frame was available.

In each take,the participants took turns entering the scene,

北京羽毛球培训walking around,and“acting naturally,”for example by

reading the newspaper or chatting with off-camera col-

leagues,forﬁve to ten minutes per take.As a result,we

collected images where the individuals have varying pos

and are at a range of distances from the camera.We dis-

carded all frames that were corrupted by electronic noi in

the coke machine,or that contained more than one person

in the scene.This latter constraint impod was to make

the task simple to specify as aﬁrst step;there is no reason

that the methods we prent below could not be extended

to work with scenes containing multiple people.

2.2.Foreground Color Extraction

To accurately capture the color information of an individual

in the image,bad primarily on their clothing,we had to

parate him or her from the background.As computer

vision is not the focus of the work,we ud only primitive

image processing methods.

A simple background subtraction algorithm was ud to

ﬁnd the foreground.We computed the per-pixel means

and variances of red,green and blue channels from294

background images.Figure2shows the mean background.

Using the means and variances of the background,we ob-

tained the foreground area in each image by thresholding.

上海外滩介绍Pixels deviating more than three standard derivations from

the mean were treated as foreground.

To improve the quality of the foreground color histogram,

we procesd the foreground area using morphological

transforms(Jain,1989).Further processing was required

becau the foreground derived from background subtrac-

tion often captured only part of the body and contained

background areas.Weﬁrst removed small islands in the

foreground by applying the open operation with a7pixel-

wide square.We then connected vertically-parated pixel

blocks(such as head and lower torso)using the clo opera-

tion with a60-pixel-by-10-pixel rectangular block.Finally,

we made sure the foreground contains the entire person by

enlarging the foreground to include neighboring pixels by

further closing the foreground with a disk of20pixels in

radius.And becau there is only one person in each im-

age,we discarded all but the largest contiguous block of

pixels in the procesd foreground.Figure3shows some

procesd foreground images.钱用英语怎么说

After this processing the foreground area is reprented

by a100-dimensional vector,which consists of a50-bin

hue histogram,a30-bin saturation histogram,and a20-bin

brightness histogram.

2.3.Face Image Extraction

The face of the person is stored as a small image,which

全身皮肤美白的方法

is derived from the outputs of a face detector(Schneider-

man2004a;2004b).Note that this is not a face recognizer

(a face recognizer was not ud for this task).It simply de-

tects the prence of frontal or proﬁle faces,and outputs the

estimated center and radius of the detected face.We took a

职称英语词汇square area around the center as the face image.If no face

was detected,the face image is empty.Figure4shows a

few face images as determined by the face detector.

2.4.Summary of the Datat

In summary,the datat is comprid of5254images for

ten individuals,collected during ven takes over four

months.There is a slight imbalance in the class distribu-

holi

Figure3.Examples of foregrounds extracted by background subtraction and morphological

transforms.

Figure4.Examples of face images detected by the face detector.

tion,and only a subt of individuals are prent in each day(refer to Table2for the breakdown).Over

all34%of the images(1808out of5254)do not contain a face. Each image in the datat is reprented by three features:

Time:The date and time the image was taken.

Color histogram of procesd foreground:A100di-mensional vector consisting of three histograms of the foreground pixels,a50-bin hue histogram,a30-bin saturation histogram,and a20-bin brightness his-togram.

Face image:A square color image of the face(if prent).

As mentioned above,this feature is missing in about 34%of the images.3.The Graphs

Graph-bad mi-supervid learning depends critically on the construction and quality of the graph.The graph should reﬂect domain knowledge through the similarity function that is ud to assign edges(and their weights). For the FreeFoodCam data the nodes in the graph are the images.An edge is formed between two images according to the following criteria:

1.Time edges.People normally move around in the

lounge at moderate speed,thus adjacent frames are likely to contain the same person.We reprent this knowledge in the graph by putting an edge between two images if their time difference is less than a threshold t1(usually a few conds).

honolulu

image2910neighbor1:time edge neighbor2:color

edge

neighbor3:color edge neighbor4:color edge neighbor5:face edge

Figure5.A random image and its neighbors in the graph.

2.Color edges.The color histogram is largely deter-

mined by a person’s apparel.We assume people

change clothes on different days,so that the color

histogram tends to be unusable across multiple days.

However,it is an informative feature during a shorter

time period(t2),such as half a day.In the graph for

every image i,weﬁnd the t of images having a time

difference between(t1,t2)to i,and connect i with its

k c-nearest neighbors(in terms of cosine similarity on

histograms)in the t.The parameter k c is a small

integer,such as three.

3.Face edges.We u face similarity over longer time

spans.For every image i with a face,weﬁnd the t

of images more than t2apart from i,and connect i

with its k f-nearest neighbor in the t.We u pixel-

wi Euclidean distance between face images,where

the pair of face images is scaled to the same size.

Theﬁnal graph is the union of the three kinds of edges.The

edges are unweighted.We ud t1=2conds,t2=12

hours,k c=3and k f=1below.Conveniently,the

parameters result in a connected graph.

It is impossible to visualize the whole graph.Instead,we

show the neighbors of a random node in Figure5.

4.Algorithms

We u the simple Gaussianﬁeld and harmonic function

algorithm(Zhu et al.,2003)on the FreeFoodCam datat.

Let l be the number of labeled images,u the number of

unlabeled images,and n=l+u.The graph is reprented

the n×n weight matrix W.Let D be the diagonal degree

matrix with D ii= j W ij,and deﬁne the combinatorial

Laplacian

L=D−W(1)

Let Y l be an l×C label matrix,where C=10is the number

of class.For l,Y l(i,c)=1if labeled image i

is in class c,Y l(i,c)=0otherwi.Then the harmonic

function solution for the unlabeled data is

Y u=−L−1uu L ul Y l(2)

where L uu is the submatrix of L on unlabeled nodes and

so on.Each row of Y u can be interpreted as the collection

of posterior probabilities p(y i=c|Y l)for C and

i∈U.Classiﬁcation is carried out byﬁnding the class with

the maximal posterior in each row.

In(Zhu et al.,2003)it has also been shown that incor-

porating class proportion knowledge can be helpful.The

proportion q c of data with label c can be estimated from

the labeled t.In particular,the class mass normalization

(CMN)heuristic scales the posteriors to meet the propor-

tions.That is,oneﬁnds a t of coefﬁcients a1,...,a C

such that

a1 i∈U Y u(i,1):···:a C i∈U Y u(i,C)=q1:···:q C

(3)

face

−

→

time

−

→

color

−

→

Figure 6.An example “gradient walk”on the graph.The walk starts from an unlabeled image,through assorted edges,and ends at a labeled image.

Classiﬁcation of an unlabeled point i is achieved by ﬁnding argmax c a c Y u (i,c ).In the experiments below we report the accuracy of both the harmonic function and CMN.4.1.Gradient Walks on the Graph

The harmonic algorithm described above solves a t of lin-ear equations so that the predicted label of each example is the average of the predicted labels of its unlabeled neigh-bors and the actual labels of its labeled neighbors.The “reasons”for the algorithm’s predictions can (roughly)be visualized by performing a “gradient walk”starting from an unlabeled example i ,always moving to the neighbor with the highest score given to the predicted label.That is,let y be the predicted label for i .If we are at node j ,we will walk to j ’s neighbor node k if

商务英语翻译k =argmax k ∼j Y u (k ,y )

(4)

The gradient walk continues until we reach a labeled ex-ample.Two gradient walk paths are shown in Figure 6and Figure 7.

5.Experimental Results

We evaluated harmonic functions on the FreeFoodCam tasks.For each task we gradually incread the labeled t size systematically,performed 30random trials for each la-beled t size.In each trial we randomly sampled a labeled t with the speciﬁed size from the ﬁrst day of a person’s appearance only .This is becau we wanted to simulate

a video surveillance scenario,where people are tagged and identiﬁed on later days.It is more difﬁcult and more real-istic than sampling labeled data from the entire datat.If a class was missing from the sampled labeled t,we redid the random sampling.The remaining images are ud as the unlabeled t.

We report the classiﬁcation accuracies with harmonic func-tions and CMN,on two different graphs.The ﬁrst graph is constructed with parameters t 1=2conds,t 2=12hours,k c =3,k f =1,the cond with k c =1.The results are prented in Figure 8.

To compare the graph-ba mi-supervid learning meth-ods against a standard supervid learning method,we ud a Matlab implementation of support vector ma-chines (Gunn,1997)as the baline.For C -class multi-class problems,we ud a one-against-all scheme which creates C binary subproblems,one for each class against all the other class,and lect the class with the largest margin.Becau we have missing features on face sub-images,the kernel for the SVM baline requires special care.We ud an interpolated linear kernel K (i,j )=w t K t (i,j )+w c K c (i,j )+w f K f (i,j ),where K t ,K c ,K f are linear kernels (inner products)on time stamp,color his-togram,and face sub-image (normalized to 50×50pix-els)respectively.If image i contains no face,we deﬁne K f (i,·)=0.The interpolation weights w t ,w c ,w f were optimized with cross validation.Notice the SVMs with such kernel are not mi-supervid:the unlabeled data are merely ud as test data.We found that the harmonic

本文发布于:2023-06-25 00:44:02，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1032538.html

上一篇：stata数据整理常用命令

下一篇：ELISA-英文版实验报告

标签：美白羽毛球外滩瑜伽英语词汇翻译皮肤

留言与评论（共有 0 条评论）