Rapid Object Detection using a Boosted Cascade of Simple Features

更新时间:2023-07-27 03:37:39 阅读：评论：0

A CCEPTED C ONFERENCE ON C OMPUTER V ISION AND P ATTERN R ECOGNITION2001 Rapid Object Detection using a Boosted Cascade of Simple

打分数

Features

Paul Viola Michael Jones mjones@ Mitsubishi Electric Rearch Labs Compaq CRL 201Broadway,8th FL One Cambridge Center

Cambridge,MA02139Cambridge,MA02142

Abstract

This paper describes a machine learning approach for vi-sual object detection which is capable of processing images extremely rapidly and achieving high detection rates.This work is distinguished by three key contributions.Theﬁrst is the introduction of a new image reprentation called the “Integral Image”which allows the features ud by our de-tector to be computed very quickly.The cond is a learning algorithm,bad on AdaBoost,which lects a small num-ber of critical visual features from a larger t and yields extremely efﬁcient classiﬁers[6].The third contribution is a method for combining increasingly more complex classi-ﬁers in a“cascade”which allows background regions of the image to b

e quickly discarded while spending more compu-tation on promising object-like regions.The cascade can be viewed as an object speciﬁc focus-of-attention mechanism which unlike previous approaches provides statistical guar-antees that discarded regions are unlikely to contain the ob-ject of interest.In the domain of face detection the system yields detection rates comparable to the best previous sys-tems.Ud in real-time applications,the detector runs at 15frames per cond without resorting to image differenc-ing or skin color detection.

1.Introduction

This paper brings together new algorithms and insights to construct a framework for robust and extremely rapid object detection.This framework is demonstrated on,and in part motivated by,the task of face detection.Toward this end we have constructed a frontal face detection system which achieves detection and fal positive rates which are equiv-alent to the best published results[16,12,15,11,1].This face detection system is most clearly distinguished from previous approaches in its ability to detect faces extremely rapidly.Operating on384by288pixel images,faces are de-

tected at15frames per cond on a conventional700MHz Intel Pentium III.In other face detection syst

ems,auxiliary information,such as image differences in video quences, or pixel color in color images,have been ud to achieve high frame rates.Our system achieves high frame rates working only with the information prent in a single grey scale image.The alternative sources of information can also be integrated with our system to achieve even higher frame rates.

There are three main contributions of our object detec-tion framework.We will introduce each of the ideas brieﬂy below and then describe them in detail in subquent ctions.

Theﬁrst contribution of this paper is a new image repre-ntation called an integral image that allows for very fast feature evaluation.Motivated in part by the work of Papa-georgiou et al.our detection system does not work directly with image intensities[10].Like the authors we u a t of features which are reminiscent of Haar Basis func-tions(though we will also u relatedﬁlters which are more complex than Haarﬁlters).In order to compute the fea-tures very rapidly at many scales we introduce the integral image reprentation for images.The integral image can be computed from an image using a few operations per pixel.

Once computed,any one of the Harr-like features can be computed at any scale or location in constant time.

The cond contribution of this paper is a method for constructing a classiﬁer by lecting a small number of im-portant features using AdaBoost[6].Within any image sub-window the total number of Harr-like features is very large, far larger than the number of pixels.In order to ensure fast classiﬁcation,the learning process must exclude a large ma-jority of the available features,and focus on a small t of critical features.Motivated by the work of Tieu and Viola, feature lection is achieved through a simple modiﬁcation of the AdaBoost procedure:the weak learner is constrained so that each weak classiﬁer returned can depend on only a 1

single feature[2].As a result each stage of the boosting process,which lects a new weak classiﬁer,can be viewed as a feature lection process.AdaBoost provides an effec-tive learning algorithm and strong bounds on generalization performance[13,9,10].

鲲图片The third major contribution of this paper is a method for combining successively more complex classiﬁers in a cascade structure which dramatically increas the speed of the detector by focusing attention on promising regions of the image.The notion behind focus of attention approaches is that it is often possible to rapidly determine where in an image an object might occur[17,8,1].More complex pro-cessing is rerved only for the promising regions.The key measure of such an approach is the“fal negative”rate of the attentional process.It must be the ca that all,or almost all,

object instances are lected by the attentional ﬁlter.

We will describe a process for training an extremely sim-ple and efﬁcient classiﬁer which can be ud as a“super-vid”focus of attention operator.The term supervid refers to the fact that the attentional operator is trained to detect examples of a particular class.In the domain of face detection it is possible to achieve fewer than1%fal neg-atives and40%fal positives using a classiﬁer constructed from two Harr-like features.The effect of thisﬁlter is to reduce by over one half the number of locations where the ﬁnal detector must be evaluated.

Tho sub-windows which are not rejected by the initial classiﬁer are procesd by a quence of classiﬁers,each slightly more complex than the last.If any classiﬁer rejects the sub-window,no further processing is performed.The structure of the cascaded detection process is esntially that of a degenerate decision tree,and as such is related to the work of Geman and colleagues[1,4].

An extremely fast face detector will have broad prac-tical applications.The include ur interfaces,image databas,and teleconferencing.In applications where rapid frame-rates are not necessary,our system will allow for signiﬁcant additional post-processing and analysis.In addition our system can be implemented on a wide range of small low power devices,including hand-helds and e

mbed-ded processors.In our lab we have implemented this face detector on the Compaq iPaq handheld and have achieved detection at two frames per cond(this device has a low power200mips Strong Arm processor which lacksﬂoating point hardware).工作规划怎么写

The remainder of the paper describes our contributions and a number of experimental results,including a detailed description of our experimental methodology.Discussion of cloly related work takes place at the end of each c-tion.

2.Features

Our object detection procedure classiﬁes images bad on the value of simple features.There are many motivations

Figure1:Example rectangle features shown relative to the enclosing detection window.The sum of the pixels which lie within the white rectangles are subtracted from the sum of pixels in the grey rectangles.Two-rectangle features are shown in(A)and(B).Figure(C)shows a three-rectangle feature,and(D)a four-rectangle feature.

for using features rather than the pixels directly.The most common reason is that features can act to encode ad-hoc domain knowledge that is difﬁcult to learn using aﬁnite quantity of training data.For this system there is also a cond critical motivation for features:the feature bad system operates much faster than a pixel-bad system.

The simple features ud are reminiscent of Haar basis functions which have been ud by Papageorgiou et al.[10]. More speciﬁcally,we u three kinds of features.The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions.The regions have the same size and shape and are horizontally or ver-tically adjacent(e Figure1).A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle.Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles.

Given that the ba resolution of the detector is24x24, the exhaustive t of rectangle features is quite large,over 180,000.Note that unlike the Haar basis,the t of rectan-gle features is overcomplete1.

2.1.Integral Image

Rectangle features can be computed very rapidly using an intermediate reprentation for the image which we call the integral image.2The integral image at location contains the sum of the pixels above and to the left of,inclusive:

Figure2:The sum of the pixels within rectangle can be computed with four array references.The value of the inte-gral image at location1is the sum of the pixels in rectangle .The value at location2is,at location3is, and at location4is.The sum within can be computed as.

where is the integral image and is the origi-nal image.Using the following pair of recurrences:

(1)

(2) (where is the cumulative row sum,, and)the integral image can be computed in one pass over the original image.

Using the integral image any rectangular sum can be computed in four array references(e Figure2).Clearly the difference between two rectangular sums can be com-puted in eight references.Since the two-rectangle features deﬁned above involve adjacent rectangular sums they can be computed in six array references,eight in the ca of the three-rectangle features,and nine for four-rectangle fea-tures.

2.2.Feature Discussion

Rectangle features are somewhat primitive when compared with alternatives such as steerableﬁlters[5,7].Steerableﬁl-ters,and their relatives,are excellent for the detailed analy-sis of boundaries,image compression,and texture analysis. In contrast rectangle features,while nsitive to the pres-ence of edges,bars,and other simple image structure,are quite coar.Unlike steerableﬁlter

s the only orientations available are vertical,horizontal,and diagonal.The t of rectangle features do however provide a rich image repre-ntation which supports effective learning.In conjunction with the integral image,the efﬁciency of the rectangle fea-ture t provides ample compensation for their limitedﬂex-ibility.

3.Learning Classiﬁcation Functions Given a feature t and a training t of positive and neg-ative images,any number of machine learning approaches

could be ud to learn a classiﬁcation function.In our sys-tem a variant of AdaBoost is ud both to lect a small t of features and train the classiﬁer[6].In its original form, the AdaBoost learning algorithm is ud to boost the clas-siﬁcation performance of a simple(sometimes called weak) learning algorithm.There are a number of formal guaran-tees provided by the AdaBoost learning procedure.Freund and Schapire proved that the training error of the strong classiﬁer approaches zero exponentially in the number of rounds.More importantly a number of results were later proved about generalization performance[14].The key insight is that generalization performance is related to the margin of the examples,and that AdaBoost achieves large margins rapidly.

Recall that there are over180,000rectangle features as-sociated with each image sub-window,a num

ber far larger than the number of pixels.Even though each feature can be computed very efﬁciently,computing the complete t is prohibitively expensive.Our hypothesis,which is borne out by experiment,is that a very small number of the features can be combined to form an effective classiﬁer.The main challenge is toﬁnd the features.

In support of this goal,the weak learning algorithm is designed to lect the single rectangle feature which best parates the positive and negative examples(this is similar to the approach of[2]in the domain of image databa re-trieval).For each feature,the weak learner determines the optimal threshold classiﬁcation function,such that the min-imum number of examples are misclassiﬁed.A weak clas-siﬁer thus consists of a feature,a threshold and

a parity indicating the direction of the inequality sign:

otherwi

Here is a24x24pixel sub-window of an image.See Ta-ble1for a summary of the boosting process.

In practice no single feature can perform the classiﬁca-tion task with low error.Features which are l

ected in early rounds of the boosting process had error rates between0.1 and0.3.Features lected in later rounds,as the task be-comes more difﬁcult,yield error rates between0.4and0.5.

3.1.Learning Discussion辞职英文

Many general feature lection procedures have been pro-pod(e chapter8of[18]for a review).Ourﬁnal appli-cation demanded a very aggressive approach which would discard the vast majority of features.For a similar recogni-tion problem Papageorgiou et al.propod a scheme for fea-ture lection bad on feature variance[10].They demon-strated good results lecting37features out of a total1734 features.

大学自我介绍英语Roth et al.propo a feature lection process bad on the Winnow exponential perceptron learning rule[11].地锦苗

The Winnow learning process converges to a solution where many of the weights are zero.Nevertheless a very large

Given example images where

for negative and positive examples respec-tively.

Initialize weights for respec-

tively,where and are the number of negatives and

positives respectively.

For:

1.Normalize the weights,

Theﬁnal strong classiﬁer is:

Table1:The AdaBoost algorithm for classiﬁer learn-ing.Each round of boosting lects one feature from the 180,000potential features.

number of features are retained(perhaps a few hundred or thousand).

3.2.Learning Results

While details on the training and performance of theﬁnal system are prented in Section5,veral simple results merit discussion.Initial experiments demonstrated that a frontal face classiﬁer constructed from200features yields a detection rate of95%with a fal positive rate of1in 14084.The results are compelling,but not sufﬁcient for many real-world tasks.In terms of computation,this clas-siﬁer is probably faster than any other published system, requiring0.7conds to scan an384by288pixel image. Unfortunately,the most straightforward technique for im-proving detection performance,adding features to the clas-siﬁer,directly increas computation time.

For the task of face detection,the initial rectangle fea-tures lected by AdaBoost are meaningful and easily inter-preted.Theﬁrst feature lected ems to focus on the prop-erty that the region of the eyes is often darker than the

小矶国昭

region

Figure3:Theﬁrst and cond features lected by Ad-aBoost.The two features are shown in the top row and then overlayed on a typical training face in the bottom row.The ﬁrst feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks.The feature capitalizes on the obrvation that the eye region is often darker than the cheeks.The cond feature compares the intensities in the eye regions to the intensity across the bridge of the no.

of the no and cheeks(e Figure3).This feature is rel-atively large in comparison with the detection sub-window, and should be somewhat innsitive to size and location of the face.The cond feature lected relies on the property that the eyes are darker than the bridge of the no.

4.The Attentional Cascade

This ction describes an algorithm for constructing a cas-cade of classiﬁers which achieves incread detection per-formance while radically reducing computation time.The key insight is that s

maller,and therefore more efﬁcient, boosted classiﬁers can be constructed which reject many of the negative sub-windows while detecting almost all posi-tive he threshold of a boosted classiﬁer can be adjusted so that the fal negative rate is clo to zero).

Simpler classiﬁers are ud to reject the majority of sub-windows before more complex classiﬁers are called upon to achieve low fal positive rates.

The overall form of the detection process is that of a de-generate decision tree,what we call a“cascade”(e Fig-ure4).A positive result from theﬁrst classiﬁer triggers the evaluation of a cond classiﬁer which has also been ad-justed to achieve very high detection rates.A positive result from the cond classiﬁer triggers a third classiﬁer,and so on.A negative outcome at any point leads to the immediate rejection of the sub-window.

Stages in the cascade are constructed by training clas-siﬁers using AdaBoost and then adjusting the threshold to minimize fal negatives.Note that the default AdaBoost threshold is designed to yield a low error rate on the train-ing data.In general a lower threshold yields higher detec-4

Reject Sub−window

Figure4:Schematic depiction of a the detection cascade.

A ries of classiﬁers are applied to every sub-window.The initial classiﬁer eliminates a large number of negative exam-ples with very little processing.Subquent layers eliminate additional negatives but require additional computation.Af-ter veral stages of processing the number of sub-windows have been reduced radically.Further processing can take any form such as additional stages of the cascade(as in our detection system)or an alternative detection system.

tion rates and higher fal positive rates.

For example an excellentﬁrst stage classiﬁer can be con-structed from a two-feature strong classiﬁer by reducing the threshold to minimize fal negatives.Measured against a validation training t,the threshold can be adjusted to de-tect100%of the faces with a fal positive rate of40%.See Figure3for a description of the two features ud in this classiﬁer.

Computation of the two feature classiﬁer amounts to about60microprocessor instructions.It ems hard to imagine that any simplerﬁlter could achieve higher rejec-tion rates.By comparison,scanning a

simple image tem-plate,or a single layer perceptron,would require at least20 times as many operations per sub-window.

The structure of the cascade reﬂects the fact that within any single image an overwhelming majority of sub-windows are negative.As such,the cascade attempts to re-ject as many negatives as possible at the earliest stage pos-sible.While a positive instance will trigger the evaluation of every classiﬁer in the cascade,this is an exceedingly rare event.

Much like a decision tree,subquent classiﬁers are trained using tho examples which pass through all the previous stages.As a result,the cond classiﬁer faces a more difﬁcult task than theﬁrst.The examples which make it through theﬁrst stage are“harder”than typical exam-ples.The more difﬁcult examples faced by deeper classi-ﬁers push the entire receiver operating characteristic(ROC) curve downward.At a given detection rate,deeper classi-ﬁers have correspondingly higher fal positive rates.4.1.Training a Cascade of Classiﬁers

The cascade training process involves two types of trade-offs.In most cas classiﬁers with more features will achieve higher detection rates and lower fal positive rates. At the same time classiﬁers with more features require more time to compute.In principle one could deﬁne an optimiza-

tion framework in which:i)the number of classiﬁer stages, ii)the number of features in each stage,and iii)the thresh-old of each stage,are traded off in order to minimize the expected number of evaluated features.Unfortunatelyﬁnd-ing this optimum is a tremendously difﬁcult problem.

In practice a very simple framework is ud to produce an effective classiﬁer which is highly efﬁcient.Each stage in the cascade reduces the fal positive rate and decreas the detection rate.A target is lected for the minimum reduction in fal positives and the maximum decrea in detection.Each stage is trained by adding features until the target detection and fal positives rates are met(the rates are determined by testing the detector on a validation t). Stages are added until the overall target for fal positive and detection rate is met.

4.2.Detector Cascade Discussion

The complete face detection cascade has38stages with over 6000features.Nevertheless the cascade structure results in fast average detection times.On a difﬁcult datat,con-taining507faces and75million sub-windows,faces are detected using an average of10feature evaluations per sub-window.In comparison,this system is about15times faster than an implementation of the detection system constructed by Rowley et al.3[12]

A notion similar to the cascade appears in the face de-tection system described by Rowley et al.in which two de-tection networks are ud[12].Rowley et al.ud a faster yet less accurate network to prescreen the image in order to ﬁnd candidate regions for a slower more accurate network. Though it is difﬁcult to determine exactly,it appears that Rowley et al.’s two network face system is the fastest exist-ing face detector.4

公司发展战略The structure of the cascaded detection process is es-ntially that of a degenerate decision tree,and as such is related to the work of Amit and Geman[1].Unlike tech-niques which u aﬁxed detector,Amit and Geman propo an alternative point of view where unusual co-occurrences of simple image features are ud to trigger the evaluation of a more complex detection process.In this way the full detection process need not be evaluated at many of the po-tential image locations and scales.While this basic insight

本文发布于:2023-07-27 03:37:39，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1119295.html

上一篇：数据挖掘TP（

下一篇：Simon雅思写作范文解析

标签：战略规划自我介绍发展工作大学公司

留言与评论（共有 0 条评论）