A CCEPTED C ONFERENCE ON C OMPUTER V ISION AND P ATTERN R ECOGNITION2001 Rapid Object Detection using a Boosted Cascade of Simple
打分数
Features
Paul Viola Michael Jones mjones@ Mitsubishi Electric Rearch Labs Compaq CRL 201Broadway,8th FL One Cambridge Center
Cambridge,MA02139Cambridge,MA02142
Abstract
This paper describes a machine learning approach for vi-sual object detection which is capable of processing images extremely rapidly and achieving high detection rates.This work is distinguished by three key contributions.Thefirst is the introduction of a new image reprentation called the “Integral Image”which allows the features ud by our de-tector to be computed very quickly.The cond is a learning algorithm,bad on AdaBoost,which lects a small num-ber of critical visual features from a larger t and yields extremely efficient classifiers[6].The third contribution is a method for combining increasingly more complex classi-fiers in a“cascade”which allows background regions of the image to b
e quickly discarded while spending more compu-tation on promising object-like regions.The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guar-antees that discarded regions are unlikely to contain the ob-ject of interest.In the domain of face detection the system yields detection rates comparable to the best previous sys-tems.Ud in real-time applications,the detector runs at 15frames per cond without resorting to image differenc-ing or skin color detection.
1.Introduction
This paper brings together new algorithms and insights to construct a framework for robust and extremely rapid object detection.This framework is demonstrated on,and in part motivated by,the task of face detection.Toward this end we have constructed a frontal face detection system which achieves detection and fal positive rates which are equiv-alent to the best published results[16,12,15,11,1].This face detection system is most clearly distinguished from previous approaches in its ability to detect faces extremely rapidly.Operating on384by288pixel images,faces are de-
tected at15frames per cond on a conventional700MHz Intel Pentium III.In other face detection syst
ems,auxiliary information,such as image differences in video quences, or pixel color in color images,have been ud to achieve high frame rates.Our system achieves high frame rates working only with the information prent in a single grey scale image.The alternative sources of information can also be integrated with our system to achieve even higher frame rates.
There are three main contributions of our object detec-tion framework.We will introduce each of the ideas briefly below and then describe them in detail in subquent ctions.
Thefirst contribution of this paper is a new image repre-ntation called an integral image that allows for very fast feature evaluation.Motivated in part by the work of Papa-georgiou et al.our detection system does not work directly with image intensities[10].Like the authors we u a t of features which are reminiscent of Haar Basis func-tions(though we will also u relatedfilters which are more complex than Haarfilters).In order to compute the fea-tures very rapidly at many scales we introduce the integral image reprentation for images.The integral image can be computed from an image using a few operations per pixel.
Once computed,any one of the Harr-like features can be computed at any scale or location in constant time.
The cond contribution of this paper is a method for constructing a classifier by lecting a small number of im-portant features using AdaBoost[6].Within any image sub-window the total number of Harr-like features is very large, far larger than the number of pixels.In order to ensure fast classification,the learning process must exclude a large ma-jority of the available features,and focus on a small t of critical features.Motivated by the work of Tieu and Viola, feature lection is achieved through a simple modification of the AdaBoost procedure:the weak learner is constrained so that each weak classifier returned can depend on only a 1
single feature[2].As a result each stage of the boosting process,which lects a new weak classifier,can be viewed as a feature lection process.AdaBoost provides an effec-tive learning algorithm and strong bounds on generalization performance[13,9,10].
鲲图片The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increas the speed of the detector by focusing attention on promising regions of the image.The notion behind focus of attention approaches is that it is often possible to rapidly determine where in an image an object might occur[17,8,1].More complex pro-cessing is rerved only for the promising regions.The key measure of such an approach is the“fal negative”rate of the attentional process.It must be the ca that all,or almost all,
object instances are lected by the attentional filter.
We will describe a process for training an extremely sim-ple and efficient classifier which can be ud as a“super-vid”focus of attention operator.The term supervid refers to the fact that the attentional operator is trained to detect examples of a particular class.In the domain of face detection it is possible to achieve fewer than1%fal neg-atives and40%fal positives using a classifier constructed from two Harr-like features.The effect of thisfilter is to reduce by over one half the number of locations where the final detector must be evaluated.
Tho sub-windows which are not rejected by the initial classifier are procesd by a quence of classifiers,each slightly more complex than the last.If any classifier rejects the sub-window,no further processing is performed.The structure of the cascaded detection process is esntially that of a degenerate decision tree,and as such is related to the work of Geman and colleagues[1,4].
An extremely fast face detector will have broad prac-tical applications.The include ur interfaces,image databas,and teleconferencing.In applications where rapid frame-rates are not necessary,our system will allow for significant additional post-processing and analysis.In addition our system can be implemented on a wide range of small low power devices,including hand-helds and e
mbed-ded processors.In our lab we have implemented this face detector on the Compaq iPaq handheld and have achieved detection at two frames per cond(this device has a low power200mips Strong Arm processor which lacksfloating point hardware).工作规划怎么写
The remainder of the paper describes our contributions and a number of experimental results,including a detailed description of our experimental methodology.Discussion of cloly related work takes place at the end of each c-tion.
2.Features
Our object detection procedure classifies images bad on the value of simple features.There are many motivations
Figure1:Example rectangle features shown relative to the enclosing detection window.The sum of the pixels which lie within the white rectangles are subtracted from the sum of pixels in the grey rectangles.Two-rectangle features are shown in(A)and(B).Figure(C)shows a three-rectangle feature,and(D)a four-rectangle feature.
for using features rather than the pixels directly.The most common reason is that features can act to encode ad-hoc domain knowledge that is difficult to learn using afinite quantity of training data.For this system there is also a cond critical motivation for features:the feature bad system operates much faster than a pixel-bad system.
The simple features ud are reminiscent of Haar basis functions which have been ud by Papageorgiou et al.[10]. More specifically,we u three kinds of features.The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions.The regions have the same size and shape and are horizontally or ver-tically adjacent(e Figure1).A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle.Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles.
Given that the ba resolution of the detector is24x24, the exhaustive t of rectangle features is quite large,over 180,000.Note that unlike the Haar basis,the t of rectan-gle features is overcomplete1.
2.1.Integral Image
Rectangle features can be computed very rapidly using an intermediate reprentation for the image which we call the integral image.2The integral image at location contains the sum of the pixels above and to the left of,inclusive:
Figure2:The sum of the pixels within rectangle can be computed with four array references.The value of the inte-gral image at location1is the sum of the pixels in rectangle .The value at location2is,at location3is, and at location4is.The sum within can be computed as.
where is the integral image and is the origi-nal image.Using the following pair of recurrences:
(1)
(2) (where is the cumulative row sum,, and)the integral image can be computed in one pass over the original image.
Using the integral image any rectangular sum can be computed in four array references(e Figure2).Clearly the difference between two rectangular sums can be com-puted in eight references.Since the two-rectangle features defined above involve adjacent rectangular sums they can be computed in six array references,eight in the ca of the three-rectangle features,and nine for four-rectangle fea-tures.
2.2.Feature Discussion
Rectangle features are somewhat primitive when compared with alternatives such as steerablefilters[5,7].Steerablefil-ters,and their relatives,are excellent for the detailed analy-sis of boundaries,image compression,and texture analysis. In contrast rectangle features,while nsitive to the pres-ence of edges,bars,and other simple image structure,are quite coar.Unlike steerablefilter
s the only orientations available are vertical,horizontal,and diagonal.The t of rectangle features do however provide a rich image repre-ntation which supports effective learning.In conjunction with the integral image,the efficiency of the rectangle fea-ture t provides ample compensation for their limitedflex-ibility.
3.Learning Classification Functions Given a feature t and a training t of positive and neg-ative images,any number of machine learning approaches
could be ud to learn a classification function.In our sys-tem a variant of AdaBoost is ud both to lect a small t of features and train the classifier[6].In its original form, the AdaBoost learning algorithm is ud to boost the clas-sification performance of a simple(sometimes called weak) learning algorithm.There are a number of formal guaran-tees provided by the AdaBoost learning procedure.Freund and Schapire proved that the training error of the strong classifier approaches zero exponentially in the number of rounds.More importantly a number of results were later proved about generalization performance[14].The key insight is that generalization performance is related to the margin of the examples,and that AdaBoost achieves large margins rapidly.
Recall that there are over180,000rectangle features as-sociated with each image sub-window,a num
ber far larger than the number of pixels.Even though each feature can be computed very efficiently,computing the complete t is prohibitively expensive.Our hypothesis,which is borne out by experiment,is that a very small number of the features can be combined to form an effective classifier.The main challenge is tofind the features.
In support of this goal,the weak learning algorithm is designed to lect the single rectangle feature which best parates the positive and negative examples(this is similar to the approach of[2]in the domain of image databa re-trieval).For each feature,the weak learner determines the optimal threshold classification function,such that the min-imum number of examples are misclassified.A weak clas-sifier thus consists of a feature,a threshold and
a parity indicating the direction of the inequality sign:
if
otherwi
Here is a24x24pixel sub-window of an image.See Ta-ble1for a summary of the boosting process.
In practice no single feature can perform the classifica-tion task with low error.Features which are l
ected in early rounds of the boosting process had error rates between0.1 and0.3.Features lected in later rounds,as the task be-comes more difficult,yield error rates between0.4and0.5.
3.1.Learning Discussion辞职英文
Many general feature lection procedures have been pro-pod(e chapter8of[18]for a review).Ourfinal appli-cation demanded a very aggressive approach which would discard the vast majority of features.For a similar recogni-tion problem Papageorgiou et al.propod a scheme for fea-ture lection bad on feature variance[10].They demon-strated good results lecting37features out of a total1734 features.
大学自我介绍英语Roth et al.propo a feature lection process bad on the Winnow exponential perceptron learning rule[11].地锦苗
The Winnow learning process converges to a solution where many of the weights are zero.Nevertheless a very large
3
Given example images where
for negative and positive examples respec-tively.
Initialize weights for respec-
tively,where and are the number of negatives and
positives respectively.
For:
1.Normalize the weights,
.
Thefinal strong classifier is:
Table1:The AdaBoost algorithm for classifier learn-ing.Each round of boosting lects one feature from the 180,000potential features.
number of features are retained(perhaps a few hundred or thousand).
3.2.Learning Results
While details on the training and performance of thefinal system are prented in Section5,veral simple results merit discussion.Initial experiments demonstrated that a frontal face classifier constructed from200features yields a detection rate of95%with a fal positive rate of1in 14084.The results are compelling,but not sufficient for many real-world tasks.In terms of computation,this clas-sifier is probably faster than any other published system, requiring0.7conds to scan an384by288pixel image. Unfortunately,the most straightforward technique for im-proving detection performance,adding features to the clas-sifier,directly increas computation time.
For the task of face detection,the initial rectangle fea-tures lected by AdaBoost are meaningful and easily inter-preted.Thefirst feature lected ems to focus on the prop-erty that the region of the eyes is often darker than the
小矶国昭
region
Figure3:Thefirst and cond features lected by Ad-aBoost.The two features are shown in the top row and then overlayed on a typical training face in the bottom row.The first feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks.The feature capitalizes on the obrvation that the eye region is often darker than the cheeks.The cond feature compares the intensities in the eye regions to the intensity across the bridge of the no.
of the no and cheeks(e Figure3).This feature is rel-atively large in comparison with the detection sub-window, and should be somewhat innsitive to size and location of the face.The cond feature lected relies on the property that the eyes are darker than the bridge of the no.
4.The Attentional Cascade
This ction describes an algorithm for constructing a cas-cade of classifiers which achieves incread detection per-formance while radically reducing computation time.The key insight is that s
maller,and therefore more efficient, boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all posi-tive he threshold of a boosted classifier can be adjusted so that the fal negative rate is clo to zero).
Simpler classifiers are ud to reject the majority of sub-windows before more complex classifiers are called upon to achieve low fal positive rates.
The overall form of the detection process is that of a de-generate decision tree,what we call a“cascade”(e Fig-ure4).A positive result from thefirst classifier triggers the evaluation of a cond classifier which has also been ad-justed to achieve very high detection rates.A positive result from the cond classifier triggers a third classifier,and so on.A negative outcome at any point leads to the immediate rejection of the sub-window.
Stages in the cascade are constructed by training clas-sifiers using AdaBoost and then adjusting the threshold to minimize fal negatives.Note that the default AdaBoost threshold is designed to yield a low error rate on the train-ing data.In general a lower threshold yields higher detec-4
Reject Sub−window
Figure4:Schematic depiction of a the detection cascade.
A ries of classifiers are applied to every sub-window.The initial classifier eliminates a large number of negative exam-ples with very little processing.Subquent layers eliminate additional negatives but require additional computation.Af-ter veral stages of processing the number of sub-windows have been reduced radically.Further processing can take any form such as additional stages of the cascade(as in our detection system)or an alternative detection system.
tion rates and higher fal positive rates.
For example an excellentfirst stage classifier can be con-structed from a two-feature strong classifier by reducing the threshold to minimize fal negatives.Measured against a validation training t,the threshold can be adjusted to de-tect100%of the faces with a fal positive rate of40%.See Figure3for a description of the two features ud in this classifier.
Computation of the two feature classifier amounts to about60microprocessor instructions.It ems hard to imagine that any simplerfilter could achieve higher rejec-tion rates.By comparison,scanning a
simple image tem-plate,or a single layer perceptron,would require at least20 times as many operations per sub-window.
The structure of the cascade reflects the fact that within any single image an overwhelming majority of sub-windows are negative.As such,the cascade attempts to re-ject as many negatives as possible at the earliest stage pos-sible.While a positive instance will trigger the evaluation of every classifier in the cascade,this is an exceedingly rare event.
Much like a decision tree,subquent classifiers are trained using tho examples which pass through all the previous stages.As a result,the cond classifier faces a more difficult task than thefirst.The examples which make it through thefirst stage are“harder”than typical exam-ples.The more difficult examples faced by deeper classi-fiers push the entire receiver operating characteristic(ROC) curve downward.At a given detection rate,deeper classi-fiers have correspondingly higher fal positive rates.4.1.Training a Cascade of Classifiers
The cascade training process involves two types of trade-offs.In most cas classifiers with more features will achieve higher detection rates and lower fal positive rates. At the same time classifiers with more features require more time to compute.In principle one could define an optimiza-
tion framework in which:i)the number of classifier stages, ii)the number of features in each stage,and iii)the thresh-old of each stage,are traded off in order to minimize the expected number of evaluated features.Unfortunatelyfind-ing this optimum is a tremendously difficult problem.
In practice a very simple framework is ud to produce an effective classifier which is highly efficient.Each stage in the cascade reduces the fal positive rate and decreas the detection rate.A target is lected for the minimum reduction in fal positives and the maximum decrea in detection.Each stage is trained by adding features until the target detection and fal positives rates are met(the rates are determined by testing the detector on a validation t). Stages are added until the overall target for fal positive and detection rate is met.
4.2.Detector Cascade Discussion
The complete face detection cascade has38stages with over 6000features.Nevertheless the cascade structure results in fast average detection times.On a difficult datat,con-taining507faces and75million sub-windows,faces are detected using an average of10feature evaluations per sub-window.In comparison,this system is about15times faster than an implementation of the detection system constructed by Rowley et al.3[12]
A notion similar to the cascade appears in the face de-tection system described by Rowley et al.in which two de-tection networks are ud[12].Rowley et al.ud a faster yet less accurate network to prescreen the image in order to find candidate regions for a slower more accurate network. Though it is difficult to determine exactly,it appears that Rowley et al.’s two network face system is the fastest exist-ing face detector.4
公司发展战略The structure of the cascaded detection process is es-ntially that of a degenerate decision tree,and as such is related to the work of Amit and Geman[1].Unlike tech-niques which u afixed detector,Amit and Geman propo an alternative point of view where unusual co-occurrences of simple image features are ud to trigger the evaluation of a more complex detection process.In this way the full detection process need not be evaluated at many of the po-tential image locations and scales.While this basic insight