Electronic Marking and Identification Techniques to Discourage Document Copying Jack T.Brassil,Senior Member,IEEE,Steven Low,Member,IEEE,Nicholas F.Maxemchuk,Fellow,IEEE,
and Lawrence O’Gorman,Senior Member,IEEE
Abstract—Modern computer networks make it possible to distribute documents quickly and economically by electronic means rather than by conventional paper means.However,the widespread adoption of electronic distribution of copyrighted material is currently impeded by the ea of unauthorized copying and dismination.In this paper we propo techniques that discourage unauthorized distribution by embedding each doc-ument with a unique codeword.Our encoding techniques are indiscernible by readers,yet enable us to identify the sanctioned recipient of a document by examination of a recovered docu-ment.We propo three coding methods,describe one in detail, and prent experimental results showing that our identification techniques are highly reliable,even after documents have been photocopied.
I.I NTRODUCTION
E LECTRONIC distribution of publications is increasingly
available through on-line text databas,CD-ROM’s, computer network bad retrieval rvices,and electronic li-braries[1]–[6].One electronic library,the RightPages1Service [7]–[9],has been in place within Bell Laboratories since1991, and has recently been installed at the University of California in San Francisco.Electronic publishing is being driven by the decreasing cost of computer processing and high quality printers and displays.Furthermore,the incread availability of low cost,high speed data communications makes it possible to distribute electronic documents to large groups quickly and inexpensively[10].
While photocopy infringements of copyright have always concerned publishers,the need for document curity is much greater for electronic document distribution[11],[12].The same advances that make electronic publishing and distribution of documents feasible also increa the threat of“bootlegged”copies.With far less effort than it takes to copy a paper doc-ument and mail it to a single person,an electronic document can be nt to a large group by electronic mail.In addition, while originals and photocopies of a paper document can look and feel different,copies of electronic documents are identical. In order for electronic publishing to become accepted, publishers must be assured that revenues will not be lost due
Manuscript received August8,1994;revid March1,1995.A preliminary version of this paper was pre
nted at IEEE INFOCOM’94.
The authors are with the AT&T Bell Laboratories,Murray Hill,NJ07974 USA.
IEEE Log Number9413489.
1RightPages is a trademark of AT& theft of copyrighted materials.Widespread unauthorized document dismination should ideally be at least as costly or difficult as obtaining the documents legitimately.Here we define“unauthorized dismination”as distribution of documents without the knowledge of—and payment to—the publisher;this contrasts legitimate document distribution by the publisher or the publisher’s electronic document distribu-tor.This paper describes a means of discouraging unauthorized copying and dismination.A document is marked in an indiscernible way by a codeword identifying the registered owner to whom the document is nt[13].If a document copy is found that is suspected to have been disminated without authorization,that copy can be decoded and the registered owner identified.
The techniques we describe here are complementary to the curity practices that can be applied to the legitimate distribution of documents.For example,a document can be encrypted prior to transmission across a computer network [14],[15].Then even if the documentfile is intercepted or stol
en from a databa,it remains unreadable to tho not posssing the decrypting key.The techniques we describe in this paper provide curity after a document has been decrypted,and is thus readable to all.
In addition to discouraging unauthorized dismination of documents distributed by computer network,our propod encoding techniques can also make paper copies of documents traceable.In particular,the codeword embedded in each doc-ument survives plain paper copying.Hence,our techniques can also be applied to“cloly held”documents,such as confidential,limited distribution correspondence.We describe this both as a potential application of the methods and an illustration of their robustness in noi.
II.D OCUMENT C ODING M ETHODS Document marking can be achieved by altering the text formatting,or by altering certain characteristics of textual ,characters).The goal in the design of coding methods is to develop alterations that are reliably decodable (even in the prence of noi)yet largely indiscernible to the reader.The criteria,reliable decoding and minimum visible change,are somewhat conflicting;herein lies the challenge in designing document marking techniques.
The marking techniques we describe can be applied to either an image reprentation of the document or to a doc-ument formatfile.The document formatfile is a computer
新东方泡泡0733–8716/95$04.00©1995IEEE
Fig.1.Example of line-shift coding.The cond line has been shifted up by 1=300
inch.
(a)
(b)名人励志演讲
Fig.2.Example of word-shift coding.In (a),the top text line has added spacing before the “for,”the bottom text line has the same spacing after the “for.”In (b),the same text lines are shown again without the vertical lines to demonstrate that either spacing appears natural.
file describing the document content and page layout (or formatting),using standard format description languages such as PostScript,2TeX,troff,etc.It is from this format file that the image—what the reader es—is generated.The image reprentation describes each page (or subpage)of a document as an array of pixels.The image may be bitmap (also called binary or black-and-white),g
ray-scale,or color.For this work,we describe both document format file and image coding techniques,however we restrict the latter to bitmaps encoded within the binary-valued text regions.
Common to each technique is that a codeword is embedded in the document by altering particular textual features.For instance,consider the codeword 1101(binary).Reading this code right to left from the least significant bit,the first document feature is altered for bit 1,the cond feature is not altered for bit 0,and the next two features are altered for the two 1bits.It is the type of feature that distinguishes each particular encoding method.We describe the features for each method below and give a simple comparison of the relative advantages and disadvantages of each technique.The three coding techniques that we propo illustrate different approaches rather than form an exhaustive list of document marking techniques.The techniques can be ud either parately or jointly.Each technique enjoys certain advantages or applicability as we discuss below.2012年大学排名
2PostScript
is a trademark of Adobe Systems,Inc.
A.Line-Shift Coding
This is a method of altering a document by vertically shifting the locations of text lines to encode the document uniquely.This encoding may be applied either to the format file or to the bitmap of a page image.The embedded codeword may be extracted from the format file or bitmap.In certain cas this decoding can be accomplished without need of the original image,since the original is known to have uniform line spacing (i.e.,“leading”)between adjacent lines within a paragraph.
B.Word-Shift Coding
This is a method of altering a document by horizontally shifting the locations of words within text lines to encode the document uniquely.This encoding can be applied to either the format file or to the bitmap of a page image.Decoding may be performed from the format file or bitmap.The method is least visible when applied to documents with variable spacing between adjacent words.Variable spacing in text documents is commonly ud to distribute white space when justifying text.Becau of this variable spacing,decoding requires the original image—or more specifically,the spacing between words in the unencoded document.See Fig.2for an example of word-shift coding.
Consider the following example of how a document might be encoded with word-shifting.For each text line,the largest
BRASSIL et al.:ELECTRONIC MARKING AND IDENTIFICATION TECHNIQUES TO DISCOURAGE DOCUMENT COPYING
1497
(a)
(b)
(c)
Fig.3.Example shows feature coding performed on a portion of text from a journal table of contents.I
n (a),no coding has been applied.In (b),feature coding has been applied to lect characters.In (c),the feature coding has been exaggerated to show feature alterations.
and smallest spacings between words are found.To code a line,the largest spacing is decremented by some amount and the smallest is augmented by the same amount.This maintains the text line length,and produces little qualitative change to the text image.
C.Feature Coding
This is a coding method that is applied either to a format file or to a bitmap image of a document.The image is examined for chon text features,and tho features are altered,or not altered,depending on the codeword.Decoding requires the original image,or more specifically,a specification of the change in pixels at a feature.There are many possible choices of text features;here,we choo to alter upward,vertical endlines—that is the tops of letters,b ,d ,h ,etc.The endlines are altered by extending or shortening their lengths by one (or more)pixels,but otherwi not changing the endline feature.See Fig.3for an example of feature coding.
Among the propod encoding techniques,line-shifting is likely to be the most easily discernible by readers.However we also expect line-shifting to be the most robust type of encoding in the prence
of noi.This is becau the long lengths of text lines provide a relatively easily detectable feature.For this reason,line shifting is particularly well suited to marking documents to be distributed in paper form,where noi can be introduced in printing and photocopying.As we will show in Section III,our experiments indicate that we can easily encode documents with line shifts that are sufficiently small that they are not noticed by the casual reader,while still retaining the ability to decode reliably.
We expect that word-shifting will be less discernible to the reader than line-shifting,since the spacing between adjacent words on a line is often varied to support text justification.Fea-ture encoding can accommodate a particularly large number of sanctioned document recipients,since there are frequently two or more features available for encoding in each word.Feature alterations are also largely indiscernible to readers.Feature encoding also has the additional advantage that it can
be applied simply to image files,which allows encoding to be introduced in the abnce of a format file.
Implementing any of the three document marking tech-niques described above incurs certain “costs”for the electronic document distributor.While the exact nature of the costs is implementation dependent,we can nonetheless make veral general remarks bad on our experience [16].Distributors must incur a small penalty in maintaining a library of “code-books”which contain a mapping of embedded codewords and recipients for each original (unmarked)document they mark and distribute.A larger penalty is paid in distributing images rather than higher level page descriptions—roughly 3–5times the number of bits must be transmitted to the subscriber.3
A technically sophisticated “attacker”can detect that a document has been encoded by any of the three techniques we have introduced.Such an attacker can also attempt to remove the encoding (e.g.,produce an unencoded document copy).Our goal in the design of encoding techniques is to make successful attacks extremely difficult or costly.We will return to a discussion of the difficulty of defeating each of our encoding techniques in Section IV.
III.I MPLEMENTATION AND E XPERIMENTAL R ESULTS FOR L INE -S HIFT C ODING M ETHOD
In this ction we describe in detail the methods for coding and decoding we ud for testing the line-shift coding method.Each intended document recipient was preassigned a unique codeword.Eac
h codeword specified a t of text lines to be moved in the document specifically for that recipient.The length of each codeword equaled the maximum number of lines that were displaced in the area to be encoded.In our line-shift encoder,each codeword element belonged to the alphabet
stretcher
{
1498IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS,VOL.13,NO.8,OCTOBER
1995
Fig.4.Profile of a recovered document page.Decoding a page with line shifting requires measuring the distances between adjacent text line centroids (marked with )or balines(marked with+)and deciding whether white space has been added or subtracted.
performance was greatly improved by constraining the t of
lines moved.In the results prented in this paper,we ud
a differential(or difference)encoding technique.With this
coding we kept every other line of text in each paragraph
unmoved,starting with thefirst line of each paragraph.Each
line between two unmoved lines was always moved either up
or down.That is,for each paragraph,the1st,3rd,5th,etc.lines
were unmoved,while the2nd,4th,etc.lines were moved.
This encoding was partially motivated by image defects we
will discuss later in this ction.Note that the conquence of
using differential encoding is that the length of each codeword
is cut approximately in half.While this reduces the potential
number of recipients for an encoded document,the number
can still be extremely large.In each of our experiments we
displaced at least19lines,which corresponds to a potential of
at
least
,,
to
,.Then
the text line centroid is given
by
and is either
shifted up or down.In the unaltered document,the distance
between adjacent balines,or baline spacings,are the same.
Let
and be the distances between
balines,
and between
balines,respectively,in the altered
document.Then the baline detection decision rule
is:
(3.2)
Unlike baline spacings,centroid spacings between adjacent
text lines in the original unaltered document are not necessarily
uniformly spaced.In centroid-bad detection,the decision is
bad on the difference of centroid spacings in the altered and
unaltered documents.More specifically,
let
and be the
centroid spacings between
lines,and between
linesusual
,respectively,in the altered document;let
and be the corresponding centroid spacings in the unaltered
document.Then the centroid detection decision rule
is:
BRASSIL et al.:ELECTRONIC MARKING AND IDENTIFICATION TECHNIQUES TO DISCOURAGE DOCUMENT COPYING1499 every other line is moved and this information is known to the
decoder,fal alarms do not occur.
A.Experimental Results for Line-Shift Coding
We conducted two ts of experiments.Thefirst t tested
how well line-shift coding works with different font sizes and
different line spacing shifts in the prence of limited,but
typical,image noi.The cond t tested how well afixed
line spacing shift could be detected as document degradation
became increasingly vere.In this ction,wefirst describe
the experiments and then prent our results.
The equipment we ud in both experiments was as follows:
a Ricoh FS1S400dpi Flat Bed Electronic Scanner,Apple
LarWriter IIntx300dpi lar printer,and a Xerox5052
plain paper copier.4The printer and copier were lected in
part becau they are typical of equipment found in wide u
in office environments.The particular machines we ud could
be characterized as being heavily ud but well maintained.
Writing the software routine to implement a rudimentary
line-shift encoder for a PostScript inputfile was simple.
We cho the PostScript format becau:1)it is the most
common Page Description Language in u today,2)it enables
us to have sufficientlyfine control of text placement,and
3)it permits us to encode documents produced by a widewithout tripping
variety of word processing applications.PostScript describes
the document content a page at a time.Roughly speaking,it
specifies the content of a text line(or text line fragment such
as a phra,word,or character)and identifies the location for
the text to be displayed.Text location is specified by an x-y
coordinate reprenting a position on a virtual page;this posi-
tion can typically be altered by arbitrarily small displacements.
However,most personal lar printers in common u today
have a300dpi“resolution,”so they are unable to distinctly
render text subject to a displacement of less than1/300inch.
1)Variable Font Size Experiment:Thefirst t of experi-
ments each ud a single-spaced page of text in the Times-
Roman font.The page was coded using the differential encod-
ing scheme.We performed nine experiments using font sizes
of8,10,or12points and shifting alternate lines(within each
paragraph)up or down by1,2,or3pixels.Each page of
8,10,and12point size text extended for23,21,and19
二建考试用书lines,respectively.Different numbers of encoded lines per
page ari naturally,since as the font size decreas,more lines
can be placed on the page,permitting more information to be
encoded.Since our printer has a300dpi resolution,each pixel
corresponds to
th copy;the,is produced
by copying the
1500IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS,VOL.13,NO.8,OCTOBER 1995
somewhat from copy to copy.This suggests that line spacing “information”is still prent in the text balines,and can perhaps be made available with some additional processing.We have reported the uncoded error performance of our marking scheme.But the 21line shifts ud in the experiment were not chon arbitrarily.The 21line shifts comprid 3concatenated codewords lected from a
Hamming
diba
].
Text line skew was largely removed by image rotation,at the expen of the introduction of some distortion due to bilinear interpolation of sampled data.
Blurring (i.e.,edge raggedness)also incread with the number of copies produced.However,blurring emed to have surprisingly minor implications in detection performance.It is possible that blurring introduces noi in a symmetrical
fashion on text lines,so it does not contribute significantly to displacing centroid locations.Plain paper copies were produced at the copier’s nominal “copy darkness”tting;blurring typically increas with copy darkness.As the number of copies incread,copy darkness generally varied over a page;regions of vere fading were sometimes obrved.It is unclear whether blurring or fading is more detrimental to decoding performance.
Expansion or shrinking of copy size is another potential problem.It is not unusual to discover a 4%page length or width change after 10copies.Further,expansion along the length and width of a page can be markedly different.Copy size changes forced us to u differential encoding—that is,encoding information in the relative rather than absolute shifts between adjacent text lines.C.A Noi Model
In this subction we prent a simple model of the noi affecting text line centroids.We distinguish
two types of noi.The first type of noi models the distortion in printing and scanning the document;the cond type models the distortion in copying.This cond type of noi increas with the number of copies while the first type does not.An unaltered page of text
with vertical
coordinates
text lines is
effectively described
by
th line spacing
shift is positive if extra space has
been added,negative if space has been subtracted,and zero otherwi.This line spacing shift changes the
浙江大学怎么样
original
.
Let th centroid
spacing in
the
th centroid
spacing)of distortion introduced by printing,scanning,and
image processing.We assume that the printer
noi
is strictly additive and logically distorts the centroid spacings of the original paper copy
to
,
are independent and identically
distributed Gaussian random variables.This assumption is supported by our measurements [22],which yield a mean痘印怎么消除小妙招
of
and variance
of
be the random noi that summarizes the cumulative
effect of skewing,scaling,and other photographic distortions
introduced on the
by making
the th copy
are
(4.3)
Hence,the centroid
spacing