AN IMAGE SIGNATURE FOR ANY KIND OF IMAGE
H.Chi Wong,Marshall Bern,and David Goldberg
Xerox Palo Alto Rearch Center
3333Coyote Hill Rd.,Palo Alto,CA94304春节去哪里玩
ABSTRACT
We describe an algorithm for computing an image signature, suitable forfirst-stage screening for duplicate images.Our signature relies on relative brightness of image regions,and is generally applicable to photographs,text documents,and line art.We give experimental results on the nsitivity and robustness of signatures for actual image collections,and also results on the robustness of signatures under transformations such as resizing,rescanning,and compression.
1.BACKGROUND AND MOTIV ATION Massive image databas are becoming increasingly com-mon.Examples include document image databas such as declassified government documents[8],and photo archives such as the New York Times archive.Duplicate removal of-fers space and bandwidth savings,and more ur-friendly arch results.Despite some effort to cull duplicates,the
im-age arch rvice of Google[4]often retrieves a number of duplicate and near-duplicate images.Duplicate detection alsofinds application in copyright protection and authenti-cation.
We believe that duplicate detection is most effectively solved in two distinct stages.A fastfirst stage reduces im-ages to small signatures,with the property that signatures of two different versions of the same image have small vec-tor distance relative to the typical distance between signa-tures of distinct images.A slow cond stage then makes a detailed comparison of candidate duplicates identified in thefirst stage.Detailed comparison of document images can identify changes as small as moving a decimal point[16].
In this paper we give a fast and simple algorithm for the first stage.Our image signature encodes the relative bright-ness of different regions of the image;it can be applied quite generally to text document images,line art(such as cartoons), and continuous-tone images.Although there are a number of image signature schemes already in the literature,there is no one signature that applies to such a wide class of im-ages.The main limitation of our signature is that it is not de-signed to handle large amounts of cropping or rotation.This design choice is appropriate for document image databas and Web arch,but not for object recognition or for detect-ing copyright violations.
2.PREVIOUS WORK
Image signatures have already been ud to address three dif-ferent,but cloly related,problems:similarity arch,au-thentication,and duplicate detection.The three problems have slightly different characteristics that affect the design of the signature.Signatures for similarity arch[10](for exam-ple,color histograms)must prerve similarity,not just iden-tity,and hence do not necessarily spread out non-duplicates very effectively.Signatures for authentication[1,11,13,15] must be more nsitive—but can be much slower—than sig-natures for thefirst stage of duplicate detection.
Nevertheless,the techniques developed for arch and authentication can be harnesd for duplicate detection.Our work adapts a method by O’Gorman and Rabinovich[11] for computing a signature for ID card photos.Their method places an grid of points on the ID card photo and com-putes a vector of8bits for each point in the grid;roughly speaking,each bit records whether a neighborhood of that point is darker or lighter than a neighborhood of its8diag-onal or orthogonal neighbors.In the usage scenario,the im-age signature computed from the photo is compared with a reference signature written on the ID card and digitally signed with public-key cryptography.
Another technique worth considering lets the image con-tent dictate the neighborhoods ud in the signature[1,14]. Schmid and Mohr[14]u a corner-point detector to define “interesting points”in the i
mage.The signature then includes a statistical abstract of the neighborhood of each interesting point.Compared to grid-bad methods such as ours,this approach has the advantage that it can handle large amounts of cropping,and if the statistics at each interesting point are rotation invariant,it can also handle arbitrary amounts of ro-tation.On the other hand,it ems to take veral hundred interesting points to obtain a reliable signature,and this ap-proach is likely to break down entirely for text documents or line art images,which may contain thousands of interesting points with very similar statistics.
Also available in the literature are specialized signature
methods for document images,using optical character recog-nition[5,8]or document-specific features such as paragraph layout,line lengths,and letter shapes[2,3].We expect that the specialized methods are more accurate than our own generic method.Our method,however,should perform quite well forfirst-stagefiltering,reducing the number of possible duplicate pairs by a factor of hundreds or thousands,where-upon still more accurate and detailed full-image matching algorithms[9,12,16]can be ud for the slow cond-stage
matching.
文科生英语
(a)A
image.(b)A image.
Fig.1.“Naturally occurring”duplicates:two images of the Mona Lisa downloaded from the Web.
3.THE SIGNATURE
橄榄核雕Ideally,we would like a signature that is small enough to allow efficient nearest-neighbor arch,nsitive enough to effectivelyfilter a databa for possible duplicates,and yet robust enough tofind duplicates that have been resized,res-canned,or lossily compresd.Figure1shows black-and-white versions of a typical duplicate pair.The are black-and-white versions of two different JPEG-compresd color images of the Mona Lisa downloaded from the Web.The images have different sizes,gray values,and cropping(look at thefingers).
We now describe the steps of our algorithm in detail.Al-though inspired by the signature algorithm of O’Gorman et al.[11],our algorithm differs at almost every step.For ex-ample,we added something to handle mild cropping,and expanded O’Gorman’s two-level relative darkness values—simply“darker”or“lighter”—tofive levels to give greater nsitivity and robustness.
Step1.If the image is color,wefirst convert it to8-bit gray-scale using the standard color-conversion algorithms included in djpeg and ppmtopgm.Pure white is reprented by255 and pure black by0.
Step2.We next impo a grid of points on the image. (For large databas,a bigger grid such as would give greaterfirst-stagefiltering.)We define the grid in a way that欺侮的读音
is robust to mild cropping,under the assumption that such cropping usually removes relatively featureless parts of the image,for example,the margins of a document image or the dark bottom of the Mona Lisa picture.For each column of the image,we compute the sum of absolute values of differ-ences between adjacent pixels in that column.We compute the total of all columns,and crop the image at the5%and 95%columns,that is,the columns such that5%of the total sum of differences lies on either side of the cropped image. We crop the rows of the image the same way(using the sums
of original uncropped rows).
Conceptually,we then divide the cropped image into a grid of blocks.We round each interior grid point to the clost pixel(that is,integer coordinates),thereby tting
a grid of points on the image.
Step3.At each grid point,we compute the average gray level of the square centered at the grid point.We ran our experiments with
where and are the dimensions of the image in pixels. The squares are slightly soft-edged,meaning that instead of using the pixel’s gray levels themlves,we u an average
of a block centered at that pixel.Figure2shows the grid of squares for half-size versions of each of the images from Figure1.
Step4.For each grid point,we compute an8-element array who elements give a comparison of the average gray level
of the grid point square with tho of its eight neighbors.The result of a comparison can be“much darker”,“darker”,“same”,“lighter”,or“much lighter”,reprented numerically as-2, -1,0,1and2,respectively.The“same”values are tho av-erages that differ by no more than2on a scale of0to255. We t the boundary between“much darker”and“darker”so that the two values are equally popular;we do the same for“lighter”and“much lighter”.The rationale in this step is that“same”may be very common in images withflat back-grounds(such as text documents),and hence it should not be included in the histogram equalization applied to the other values.Grid points in thefirst or last rows or column have fewer than8neighbors;for simplicity wefill in the missing comparisons—there are104of them—in the array with ze-ros.
Step5.The signature of an image is simply the concatena-tion of the8-element arrays corresponding to the grid points, ordered left-to-right,top-to-bottom.Our signatures are thus
vectors of length .We store them in 648-byte arrays,but becau some of the entries for the first and last rows and columns are known to be zeros and becau each byte is ud to hold only 5values,signatures could be reprented by as few as
bits.
(a)
squares.(b)
squares.
Fig.2.The grid squares for the Mona Lisa images.
4.NEAREST NEIGHBORS
The normalized distance between two image signa-tures and is ,where reprents the norm of a vector,that is,its Euclidean length.Rela-tive to the
norm (or Hamming distance,as ud in [11]),the norm gives greater emphasis to larger differences;this is appropriate becau a difference of 1at a coordinate may be due to roundoff,but a difference of 2is meaningful.
It is necessary to normalize the distance,rather than just
using
,
in order to compare the distance against a fixed threshold.A threshold of ems to be a good choice,giving reasonable robustness to image transformations with-out admitting very many fal matches.For text documents,a slightly lower threshold is better,becau their signatures usually have many more zeros than do the signatures of conti-nuous-tone images.An easy fix,which allows the u of a threshold for any type of image,is to change basic arith-metic:count the difference between a 0and a 2or in
as 3.We ud this fix in the experiments reported
below.
Given a query signature ,we now need a way to find all signatures in a databa within normalized distance of or less.The usual way to accomplish this task involves chop-ping up the signatures into (possibly non-contiguousand over-lapping)words of bytes,and finding all ’s that exactly match on some word.For each of the word positions,an index table points to all the signatures with each given word.The parameters and are chon so that any within nor-malized distance of is sure to match on at least one word,and so that the space requirement of the index tables is not too large.This method solves the problem of nearest-neighbor arch in high dimensions by finding exact matches in a nu
m-ber of lower-dimensional projections.There are also more sophisticated algorithms and analysis of nearest neighbor arch in high dimensions [6,7].
In our ca,a normalized distance of allows and to differ at every byte,so there is no good choice of word length .Therefore we propo that in the index tables,we lump together and ,that is,reprent them both by ,and similarly lump together and .The table entries,although indexed by a words over 3possible letters,still point to the more preci 5-letter signatures.Now a reasonable choice
would be
and .A signature within of the query is very likely to match on some word.If
each letter of the lumped version of has probability
of matching the corresponding letter in the lumped version of ,then each word has probability of matching,and hence the chance that at least one of 100words match is about ,which is greater than .993.A random signature ,however,is unlikely to match on some word.If each letter of lumped has probability of matching the corresponding letter in lumped ,then each word has prob-ability
of matching,and hence the chance that at least one of the 81words match is about .Hence only abou
t 10%of the databa signatures need to have their actual normalized distances to computed.Be-cau the computation of normalized distances is very fast,this should be quite tolerable for a million-image databa.Larger values of would give greater discriminating power but u more memory space.As grows larger,signatures within would tend to have many more word matches than random signatures.
Notice that with this method for nearest-neighbor arch-ing,the overall image duplicate detection process becomes a three-level scheme.The indexing on words finds candidate signature pairs,which in turn lead to candidate image pairs.
5.RESULTS
We tested the robustness (the “recall”)of our algorithm on both synthetic and natural data.The synthetic data consist of a t of original images (600photo CD images,belong-ing to 6themes -Michelangelo,transportation,cityscape,etc -and 170document images -pages from a Ph.D.thesis)and their synthetic duplicates,generated by rotating,resiz-ing,and compressing/decompressing the originals.The bar charts below show the robustness of our algorithm in finding synthetic duplicates.The natural data consist of 10ts of images downloaded from the Web,each containing a
num-ber of visually (clo-to-)identical images of different sizes,with different amounts of cropping,and some even with slight
color differences.Our algorithm handles well the natu-rally occurring duplicates,failing only in cas where there is a significant amount of cropping or color difference.
爱家乡Rotation (in degrees)
10
305070902
4
%of images Resizing
10
305070900.65
1.25
2.0
高花2040608010050
70
90Documents
We also tested the nsitivity (the “precision”)of our al-gorithm by running it on our original t of photo CD images and document images.The table below shows the average number of fal positives returned for each type of images.image type
photo CD images
缆车4.05(in a total of 170)
悄多音字As future work,we plan to conduc studies comparing our method with other existing methods.
Acknowledgments.We would like to thank Paul Stewart and Rusl Atkinson for their help with software tools.
6.REFERENCES
[1]S.Bhattacharjee and M.Kutter.Compression tolerant im-age authentication.Proc.ICIP ,V ol.1,pp.435–439,1998.[2]V .Chalana,A.G.Bruce,and T.Nguyen.Duplicate doc-ument detection in DocBrow.Proc.SPIE Document Recognition and Retrieval Conference ,1999.[3] D.Doermann,H.Li,and O.Kia.The detection of dupli-cates in document image databas.Proc.Int.Conf.Doc-ument Analysis and Recognition ,pp.314–318,1997.documents.cfar.umd.edu/LAMP/[4]Google Advanced le.com [5]J.Hull.Document image similarity and equivalence detec-tion.Pattern Recognition Letters ,2001,545–550.[6]P.Indyk and R.Motwani.Approximate nearest neighbors:
towards removing the cur of dimensionality.ACM Symp.Theory of Computing ,1999.[7]P.Indyk,R.Motwani,P.Raghavan,S.Vempala.Locality-prerving hashing in multidimensional spaces.ACM Symp.Theory of Computing ,1997.[8] D.Lopresti.A comparison of text-bad methods for
detecting duplication in document image databas.Proc.Document Recognition and Retrieval VI ,SPIE Electronic Imaging ,V ol.3967,pp.210–221,2000./org/1133/Rearch/
DocumentAnalysis/dup.html [9]J.Mao and K.Mohiuddin.Form dropout using distance
transformation.Proc.ICIP ,V ol.3,pp.328–331,1995.[10]W.Niblack,R.Barber,and W.Equitz.The QUBIC project:
querying images by content using color,texture,and shape.IBM Rearch Report #RJ 9203,February 1993.[11]L.O’Gorman and I.Rabinovich.Secure identification doc-ument via pattern recognition and public-key cryptography.IEEE Trans.Pattern Analysis and Machine Intelligence 20(1998),1097–1102.[12]H.Peng,F.Long,W.-C.Siu,Z.Chi,and D.Feng.Doc-ument image matching bad on component blocks.Proc.ICIP ,V ol.2,pp.601–604,2000.[13]M.Ruhl,M.Bern,and D.Goldberg.Secure notarization of
paper text documents.Proc.12th SIAM Symp.Disc.Algo-rithms ,2001.[14] C.Schmid and R.Mohr.Image retrieval using local charac-terization.Proc.ICIP ,pp.781–784,1996.[15]R.Venkatesan,S.-M.Koon,M.H.Jakubowski,and P.
Moulin.Robust image hashing.Proc.ICIP ,2000.[16]M.Ye,M.Bern,and D Goldberg.Document image match-ing and annotation lifting.Proc.Int.Conf.Document Anal-ysis and Recognition ,2001.