9
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998 1897 Collusion-Secure Fingerprinting for Digital Data Dan Boneh and James Shaw Abstract— This paper discusses methods for assigning code- words for the purpose of fingerprinting digital data, e.g., software, documents, music, and video. Fingerprinting consists of uniquely marking and registering each copy of the data. This marking allows a distributor to detect any unauthorized copy and trace it back to the user. This threat of detection will deter users from releasing unauthorized copies. A problem arises when users collude: for digital data, two different fingerprinted objects can be compared and the differences between them detected. Hence, a set of users can collude to detect the location of the fingerprint. They can then alter the fingerprint to mask their identities. We present a general fingerprinting solution which is secure in the context of collusion. In addition, we discuss methods for distributing fingerprinted data. Index Terms—Collusion resistence, fingerprinting, marking as- sumption, watermarking. I. INTRODUCTION F INGERPRINTING is an old cryptographic technique. Several hundred years ago logarithm tables were protected by fingerprinting them. The idea was to introduce tiny errors in the insignificant digits (i.e., tenth digit right of the decimal point) of for a few random values of . A different set of ’s was chosen for each copy of the logarithm table. If an owner of a logarithm table ever sold illegal copies of it, the tiny errors in the table enabled the “police” to trace an illegal copy back to its owner. The owner would then be prosecuted and penalized in some way. Nowadays no one is interested in protecting logarithm tables. However, the technique of fingerprinting is still in use. Examples include maps, diamonds, and explosives. With the increasing popularity of digital data, there is a strong desire to fingerprint these data as well. Examples of digital data to which fingerprinting may apply include documents, images, movies, music, and executables. When fingerprinting digital data one must address the prob- lem of collusion. For instance, suppose the logarithm table discussed above is stored as a file. Each user is given a slightly altered copy of the file. If two users get together they can easily run diff on their two files to discover all the locations where the files differ. This simple operation reveals the location of Manuscript received July 10, 1997; revised March 22, 1998. The material in this paper was presented in part at CRYPTO’95, Santa Barbara, CA, August 1995. D. Boneh is with the Department of Computer Science, Stanford University, Stanford, CA 94305-9045 USA. J. Shaw was with the Department of Computer Science, Princeton Univer- sity, Princeton, NJ 08544 USA. He is now at 35 Olden Street, Princeton, NJ 08544 USA. Publisher Item Identifier S 0018-9448(98)05081-0. the hidden marks. The users can then remove these marks and resell their logarithm table without ever worrying about being caught. Notice that two users could only detect those marks in which their copies differ. They could not detect marks where both their copies agree. We intend to use this small amount of information which the two users could not remove to trace any copy they generate back to one of them. Throughout the paper we use the following terminology: A mark is a position in the object which can be in one of different states. For instance, in the logarithm table example, introducing an error in the value of means that is marked. If there are error values used we say that the mark has possible states. A fingerprint is a collection of marks. Thus a fingerprint can be thought of as a word of length over an alphabet of size . Here is the number of marks embedded in the object. A distributor is the sole supplier of fingerprinted objects. A user is the registered owner of a fingerprinted object. The process of fingerprinting an object involves assigning a unique codeword over to each user. The user receives a copy of an object with the marks set according to his assigned codeword. By colluding, users can detect a specific mark if it differs between their copies; otherwise, a mark cannot be de- tected. The main property the marks should satisfy is that users cannot change the state of an undetected mark without render- ing the object useless. We assume that marks satisfying these properties exist for the objects being fingerprinted. We refer to this as the Marking Assumption for which a precise definition is given in the next section. Note that if there is no collusion, by the Marking Assumption, fingerprinting is trivial: the fin- gerprint assigned to each user will be the user’s serial number. There has been much research investigating the Marking Assumption in a variety of domains. Wagner [22] gives a tax- onomy of fingerprints and suggests subtle marks for computer software. Marks have been embedded in digital video [6], [9], [10], [20], in documents [5], and in computer programs [12]. In all these domains, our scheme allows these marks to be combined to form collusion-resistant fingerprints. Thus our results are general, applying to a variety of digital data. Previously, a weaker model of collusion-secure finger- printing was studied in [4]. Our results are more efficient by an exponential amount both is terms of the number of users and in terms of the coalition size. The reason for this dramatic improvement is our use of randomness. We rely on randomness in two steps: one is in the construction and proof of security of our fingerprinting codes. The other is in the composition of our construction with earlier elegant results of Chor, Fiat, and Naor [8]. 0018–9448/98$10.00 1998 IEEE

Collusion-secure fingerprinting for digital data

  • Upload
    j

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Collusion-secure fingerprinting for digital data

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998 1897

Collusion-Secure Fingerprinting for Digital DataDan Boneh and James Shaw

Abstract—This paper discusses methods for assigning code-words for the purpose of fingerprinting digital data, e.g., software,documents, music, and video. Fingerprinting consists of uniquelymarking and registering each copy of the data. This markingallows a distributor to detect any unauthorized copy and traceit back to the user. This threat of detection will deter usersfrom releasing unauthorized copies. A problem arises when userscollude: for digital data, two different fingerprinted objects can becompared and the differences between them detected. Hence, a setof users can collude to detect the location of the fingerprint. Theycan then alter the fingerprint to mask their identities. We presenta general fingerprinting solution which is secure in the contextof collusion. In addition, we discuss methods for distributingfingerprinted data.

Index Terms—Collusion resistence, fingerprinting, marking as-sumption, watermarking.

I. INTRODUCTION

FINGERPRINTING is an old cryptographic technique.Several hundred years ago logarithm tables were protected

by fingerprinting them. The idea was to introduce tiny errorsin the insignificant digits (i.e., tenth digit right of the decimalpoint) of for a few random values of. A different setof ’s was chosen for each copy of the logarithm table. If anowner of a logarithm table ever sold illegal copies of it, thetiny errors in the table enabled the “police” to trace an illegalcopy back to its owner. The owner would then be prosecutedand penalized in some way.

Nowadays no one is interested in protecting logarithmtables. However, the technique of fingerprinting is still in use.Examples include maps, diamonds, and explosives. With theincreasing popularity of digital data, there is a strong desireto fingerprint these data as well. Examples of digital data towhich fingerprinting may apply include documents, images,movies, music, and executables.

When fingerprinting digital data one must address the prob-lem of collusion. For instance, suppose the logarithm tablediscussed above is stored as a file. Each user is given a slightlyaltered copy of the file. If two users get together they can easilyrun diff on their two files to discover all the locations wherethe files differ. This simple operation reveals the location of

Manuscript received July 10, 1997; revised March 22, 1998. The material inthis paper was presented in part at CRYPTO’95, Santa Barbara, CA, August1995.

D. Boneh is with the Department of Computer Science, Stanford University,Stanford, CA 94305-9045 USA.

J. Shaw was with the Department of Computer Science, Princeton Univer-sity, Princeton, NJ 08544 USA. He is now at 35 Olden Street, Princeton, NJ08544 USA.

Publisher Item Identifier S 0018-9448(98)05081-0.

the hidden marks. The users can then remove these marks andresell their logarithm table without ever worrying about beingcaught. Notice that two users could only detect those marks inwhich their copies differ. They could not detect marks whereboth their copies agree. We intend to use this small amountof information which the two users could not remove to traceany copy they generate back to one of them.

Throughout the paper we use the following terminology:A mark is a position in the object which can be in one ofdifferent states. For instance, in the logarithm table example,introducing an error in the value of means thatis marked. If there are error values used we say that themark has possible states. Afingerprint is a collection ofmarks. Thus a fingerprint can be thought of as a word oflength over an alphabet of size . Here is the number ofmarks embedded in the object. Adistributor is the sole supplierof fingerprinted objects. Auser is the registered owner of afingerprinted object.

The process of fingerprinting an object involves assigninga unique codeword over to each user. The user receives acopy of an object with the marks set according to his assignedcodeword. By colluding, users can detect a specific mark if itdiffers between their copies; otherwise, a mark cannot be de-tected. The main property the marks should satisfy is that userscannot change the state of an undetected mark without render-ing the object useless. We assume that marks satisfying theseproperties exist for the objects being fingerprinted. We refer tothis as theMarking Assumptionfor which a precise definitionis given in the next section. Note that if there is no collusion,by the Marking Assumption, fingerprinting is trivial: the fin-gerprint assigned to each user will be the user’s serial number.

There has been much research investigating the MarkingAssumption in a variety of domains. Wagner [22] gives a tax-onomy of fingerprints and suggests subtle marks for computersoftware. Marks have been embedded in digital video [6], [9],[10], [20], in documents [5], and in computer programs [12].In all these domains, our scheme allows these marks to becombined to form collusion-resistant fingerprints. Thus ourresults are general, applying to a variety of digital data.

Previously, a weaker model of collusion-secure finger-printing was studied in [4]. Our results are more efficientby an exponential amount both is terms of the number ofusers and in terms of the coalition size. The reason for thisdramatic improvement is our use ofrandomness. We rely onrandomness in two steps: one is in the construction and proofof security of our fingerprinting codes. The other is in thecomposition of our construction with earlier elegant results ofChor, Fiat, and Naor [8].

0018–9448/98$10.00 1998 IEEE

Page 2: Collusion-secure fingerprinting for digital data

1898 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

We note that several recent proposals enhance the function-ality of fingerprinting schemes in various ways. Assymetricfingerprinting [3], [15], [16], ensures that a corrupt distributorcannot frame an innocent user. Anonymous fingerprinting [17]makes use of a registration service to eliminate the need forthe distributor to keep detailed records binding codewordsto users. The combinatorial properties required for collusion-secure fingerprinting are further studied in [7] and [19].

The paper is organized as follows. In Section II, we intro-duce our notation and explicitly state the Marking Assumption.In Section III, we discuss a naive scenario to familiarize thereader with the notation. In Section IV, we define our notionof collusion-secure codes and in Section V, we construct suchcodes. In Section VI, we give a lower bound on the length ofcollusion-secure codes. Section VII describes a scheme for dis-tributing fingerprinted data on bulk media such as CD-ROM’s.

Throughout the paper we use the following notation. Givenan -bit word and a setwe denote by the word where is theth letter of . We refer to as therestriction of to the

positions in .

II. FINGERPRINTING CODES

We begin by defining some notation. From here onwilldenote some alphabet of sizerepresenting the differentstates of the marks. The letters in will be denoted by theintegers to .

Definition II.1: A set will becalled an -code. The codeword will be assigned touser , for . We refer to the set of words in asthe codebook.

Definition II.2: Let be an -code and be a coalition of users. Forwe say that position is undetectable for if the wordsassigned to users in match in their th position. Formally,suppose . Then position is undetectable if

.

As was discussed in Section I, our objective is to designa collusion-secure method of assigning codewords to users.Let be a coalition of users. We must first characterize theset of objects which the coalition can generate. Supposethe th mark is detectable by the coalition. The coalitioncan generate an object in which theth mark is in any ofits states. Furthermore, the coalition can generate an objectin which the mark is in anunreadablestate. When the policerecovers an illegal copy of the object it cannot determine whichstate an unreadable mark is in. For instance, in the logarithmtable example this would correspond to the coalition creatinga table which does not contain the entry where isa detectable mark. We denote a mark in an unreadable state by“?.” The resulting set of codewords is called thefeasibleset ofthe coalition. Formally thefeasibleset is defined as follows.

Definition II.3: Let be an -code and be a coalition of users. Let be the set of

undetectable positions for . Define thefeasible setof as

s.t.

for some user in . Thus the feasible set contains all wordswhich match the coalition’s undetectable bits. Usually we omitthe and denote by .

We will occasionally denote by . The MarkingAssumption discussed in Section I states that any coalition of

users is only capable of creating an object whose fingerprintlies in the feasible set of the coalition.

Example: If two users and are assigned the codewords

then their feasible set is .

III. PROTECTION AGAINST NAIVE REDISTRIBUTION

To familiarize the reader with our notation we begin byconsidering a toy problem which we refer to asnaive redistri-bution. Naive redistribution occurs when a user redistributeshis copy of the object without altering it. If an unauthorizedcopy of the object is found containing user’s codeword wewould like to say that user is guilty. However, couldclaim that he was framed by a coalition who created an objectcontaining his codeword. Thus we would like to constructcodes that satisfy the following property: no coalition cancollude to frame a user not in the coalition. We usually relaxthis condition by limiting the size of the coalition tousers.We call such codes-frameproof codes.

If the code used to fingerprint the object is kept hidden fromthe users, then the construction of frameproof codes becomestrivial: to every one of the users assign a unique codewordchosen at random. A coalition of users cannot frame a usernot in the coalition since they do not know his codeword. Wewould like to construct codes that are-frameproof even ifthe codebook is known to the users. This requirement can beformally stated as follows.

Definition III.1: A code is -frameproof if every set, of size at most , satisfies .

The definition states that in a-frameproof code, the onlycodewords in the feasible set of a coalition of at mostusers are the codewords of members of the coalition. Thusno coalition of at most users can frame a user who is nota member of the coalition. It is interesting to note that forrandom codes the length of the code must be exponential in.Otherwise, a coalition of users is likely to detect all the bits.

A. Construction of -Frameproof Codes

We now show a construction for-frameproof codes overthe binary alphabet . First we introduce a simple

-code which is -frameproof. Define the codeto be the -code containing all -bit binary words withexactly one . For example, the code for three users is

. The following claim is immediate.

Page 3: Collusion-secure fingerprinting for digital data

BONEH AND SHAW: COLLUSION-SECURE FINGERPRINTING FOR DIGITAL DATA 1899

Claim III.1: is a -code which is -frameproof.

It is not difficult to see that any -frameproof code musthave length at least . This follows since any coalition of

users must not be able to detect at least one of thebit positions; otherwise, they could frame a user not in thecoalition. Since there are coalitions of size , the codemust have at least length. Hence, the code length of isoptimal. The length of is linear in the number of users andis therefore impractical. We will use the code to constructshorter codes. We first recall some basic definitions from thetheory of error-correcting codes; see [21] for more details.

Definition III.2: A set of words of length overan alphabet of letters is said to be an -Error-Correcting Codeor in short, an -ECC, if theHamming distance between every pair of words inis atleast .

The idea of this construction is to compose the codewith an error-correcting code. Let bean -code and let be an -ECC. We denote thecomposition of and by . The code is an -codedefined as follows: for a codeword let

where means concatenation of strings. The codeis theset of all words , i.e.,

Lemma III.2: Let be a -frameproof -code and bean -ECC. Let be the composition of and .Then is a -frameproof code, provided .

Proof: Let be a coalition of users. We show thatcontains no words from other than those of .

Let be the codewords of from which thecodewords of the coalition were derived.

Assume towards a contradiction that contains aword which belongs to a user . Let

be the codeword from which was derived. For all, the words and match in less than

positions. This follows since the minimal Hamming distanceof is bigger than . Hence, there must exist aposition for which for all .

Let . Since is a -frameproofcode we know that is not in . Sinceis a subword of , this implies that . Thiscontradiction proves the lemma.

We note that the condition that has a large minimaldistance can be relaxed. To make the proof work it sufficesto require that no set of words of “cover” a word ofoutside the set. This property has been studied in [11]. Usingthis relaxed requirement does not improve the constructions.

The error-correcting codes we are using have large minimaldistance and hence, low rate. By picking the codewordsrandomly it is possible to obtain a good low rate code. We

state this in the following lemma, which is immediate fromthe Chernoff bound [2, Appendix A].

Lemma III.3: For any positive integers let. Then there exists a -ECC where

.

The main theorem of this section now follows.

Theorem III.4: For any integers let. Then there exist an -code which is -

frameproof.Proof: By Lemma III.3 we know that there exists an

error-correcting code with parametersfor . Combining this with the code andLemma III.2 we get a -frameproof code for users whoselength is .

To make this construction explicit we must use an explicitlow-rate error-correcting code. Explicit constructions of suchcodes are described in [1]. The explicit construction are not asgood as the bounds provided by Lemma III.3. Using a simpleexplicit low-rate code it is possible to obtain codes of length

.

IV. -SECURE CODES

We now turn our attention to the full problem of collusion-secure fingerprinting. Suppose a distributor marks an objectwith a code . Now, suppose a coalition of users,, colludesto generate an unregistered object marked by some wordandthen distributes this new object. When this object is found, thedistributor would like to detect a subset of the coalition whocreated it. In other words, there must exist atracing algorithm

which on input outputs a member of the coalition. Forour purposes the tracing algorithm may be regarded as afunction , where in the number ofusers. This leads to the following definition.

Definition IV.1: A code is totally -secureif there existsa tracing algorithm satisfying the following condition: ifa coalition of at most users generates a word then

.The tracing algorithm on input must output a member

of the coalition that generated the word. Hence, an illegalcopy can be traced back to at least one member of the guiltycoalition. Clearly there is no hope in recovering the entirecoalition since some of its members might be passive; theyare part of the coalition, but they contribute nothing to theconstruction of the illegal copy.

We now derive a necessary condition of a code to be totally-secure. Consider the following scenario: letbe some code.

Let and be two coalitions of users each such that. Suppose an unregistered object is found which

is marked by a codeword which is feasible for both and. Then both coalitions are suspect. Since their intersection

is empty, it is not possible to determine with certainty whocreated the unregistered object. It follows that ifis totally-secure then when the intersection of and is empty,

the intersection of their feasible sets and mustalso be empty. In general we obtain the following lemma.

Page 4: Collusion-secure fingerprinting for digital data

1900 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

Lemma IV.1: If is a totally -secure code then

for all coalitions of at most users each.

It seems that totally secure codes provide a good solution tothe problem of collusion. Unfortunately, when , totally-secure codes do not exist.

Theorem IV.2:For and there are no totally-secure -codes.

Proof: Clearly, it is enough to show that there are nototally -secure codes. Let be an arbitrary -code.Let be three distinct codewords assignedto users respectively. Define the majority word

MAJ by

if or

if

otherwise.

One can readily verify that the majority word is feasible forall three coalitions . However,the intersection of the coalitions is empty. Therefore, byLemma IV.1, the code is not totally -secure.

The proof of the theorem shows that if a coalition em-ploys the “majority” strategy it is guaranteed to defeat allfingerprinting codes. Based on this result it seems that all islost: fingerprinting is not possible in the presence of collusion.Fortunately, there is a way out of this trap:use randomness.

Theorem IV.2 forces us to weaken our requirements formarking schemes. We intend to allow the distributor to makesome random choices when embedding the codewords in theobjects. The point is that the random choices will be kepthidden from the users. This enables us to construct codeswhich will capture a member of the guilty coalition with highprobability.

An fingerprinting scheme is a function whichmaps a user number and a string of random bits

to a codeword in . The random string is theset of random bits used by the distributor and kept hiddenfrom the users. We denote a fingerprinting scheme by.

Suppose a coalition of users creates an illegal copy ofan object. Fingerprinting schemes that enable the capture ofa member of the coalition with probability at leastare called -secure codes with-error. Here the probability istaken over the random choices made by the distributor and therandom choices made by the coalition.

Definition IV.2: A fingerprinting scheme is -securewith -error if there exists a tracing algorithm satisfyingthe following condition: if a coalition of at most usersgenerates a word then

where the probability is taken over the random bitsand therandom choices made by the coalition.

The tracing algorithm on input outputs a member of thecoalition that generated the word with high probability.

With this definition at hand we turn to the construction of-secure codes.

We point out that Chor, Fiat, and Naor [8] considered asimilar problem in an entirely different settings. In our termstheir result enables one to construct-secure codes underthe assumption that marks cannot become unreadable. Underthis assumption one can even construct totally-secure codes.Indeed, the proof of Theorem IV.2 relied on the existence ofunreadable marks.

Unfortunately, in the context of fingerprinting the assump-tion that marks cannot become unreadable is unrealistic. Aswas discussed in Section II, once a coalition detects a mark,that mark can be made unreadable in various ways (recall thatby unreadable we mean that when an illegal copy is discoveredit is impossible to determine which state the mark is in). Forthis reason, the results of Chor, Fiat, and Naor by themselvesare insufficient for fingerprinting. However, as we shall see,they become quite useful when composed with our resultsdescribed in the next section.

V. CONSTRUCTION OFCOLLUSION-SECURE CODES

The idea for the construction-secure codes is similar tothe one used in Section III-A. We first construct an -codewhich is -secure. Thus no matter how large the coalition is,we will be able to trace an illegal copy back to a member ofthe coalition with high probability. The length of this code is

and hence, too large to be practical. We then show howthis code can be used to construct-secure codes for userswhose length is when .

We begin by presenting an -code which is -securewith -error for any . Let be a column of height

in which the first bits are and the rest are . Thecode consists of all columns , eachduplicated times. The amount of duplication determines theerror probability . For example, the code for fourusers is

Let denote the codewords of . Beforethe distributor embeds the codewords of in anobject he makes the following random choice: the distributorrandomly picks a permutation1 . User ’s copy of theobject will be fingerprinted using the word . Note thatthe same permutation is used for all users. The point is that

will be kept hidden from the users. Keeping the permutationhidden from the users is equivalent to hiding the informationof which mark in the object encodes which bit in the code.It is a bit surprising that this simple random action taken bythe distributor is sufficient to overcome the barrier of TheoremIV.2 and enables us to prove the following theorem.

1Sl is the full symmetric group of all permutations onl letters. For a wordx 2 f0; 1gl and a permutation� 2 Sl we denote by�x the l-bit wordx�(1)x�(2) � � � x�(l).

Page 5: Collusion-secure fingerprinting for digital data

BONEH AND SHAW: COLLUSION-SECURE FINGERPRINTING FOR DIGITAL DATA 1901

Theorem V.1:For and let .The fingerprinting scheme is -secure with -error.

The length of this code is . Toprove the theorem we must describe an algorithm, which givena word generated2 by some coalition , outputs a memberof with probability . First we introduce some notation.

1) Let be the set of all bit positions in which the userssee columns of type . That is, is the set of all bitpositions in which the first users see a and the restsee a . The number of elements in is .

2) For define .3) For a binary string , let denote the number

of ’s in .

Before we describe the algorithm we give some intuition.Suppose user is not a member of the coalition whichproduced the word . The hidden permutation prevents thecoalition from knowing which marks represent which bits inthe code . The only information the coalition has isthe value of the marks it can detect. Observe that withoutuser a coalition sees exactly the same values for all bitpositions . For instance, in the code above,the coalition sees the exact same bit pattern for allbit positions in . Hence, for a bit position , thecoalition cannot tell if lies in or in . This meansthat whichever strategy they use to set the bits of , the

’s in will be roughly evenly distributed betweenand with high probability. Hence, if the’s in arenot evenly distributed then, with high probability, useris amember of the coalition that generated.

Algorithm 1: Given , find a subset of thecoalition that produced .

1) If then output “User is guilty.”2) If then output “User is guilty.”3) For all to do: Let . If

then output “User is guilty.”

One issue needs some clarification: the wordfound in theillegal copy may contain some unreadable marks “?.” As aconvention these bits are set to “” before the word is givento Algorithm 1. As a result, the algorithm indeed receives aword in . The correctness of Algorithm 1 is proved inthe next two lemmas.

Lemma V.2: Consider the code where. Let be the set of users which Algorithm

1 pronounces as guilty on input. Then with probabilityat least , the set is a subset of the coalition thatproduced .

Proof: Suppose user was pronounced guilty, i.e.,. Then . This implies that user must be a

member of (otherwise, the bits in would be undetectable

2When we say that a coalitionC generated a wordx, we mean that thebits of x have already been unscrambled using��1. For example, the firstbit of x is the value of the mark which encodes the first bit of the codewords.

for which would imply that ). Similarly,if then .

Suppose the algorithm pronounces user as guilty.We show that the probability thatis not guilty is at most .This will show that the probability that there exists a user in

which is not guilty is at most .Let be an innocent user, i.e., . As was discussed

above, this means that the coalition cannot distinguishbetween the bit positions in . Since the permutation waschosen uniformly at random from the set of all permutations,the ’s in may be regarded as being randomly placedin . Let . Define to be a randomvariable which counts the number of’s in given that

contains ’s. For any integer in the appropriate range:

where is the size of . Clearly, theexpectation of is . To bound the probability that waspronounced guilty we need to bound

from above. This can be done by comparingto an appro-priate binomial random variable.

Let be a binomial random variable overexperimentswith success probability . A routine calculation shows thatfor any we have that . Thismeans that for any

where the last inequality follows from the standard Chernoffbound [2, Appendix A]. Plugging inleads to

Thus if user is innocent then the probability of her beingpronounced guilty by Algorithm 1 is at most . Therefore,the probability that some innocent user will be pronouncedguilty is at most . This proves the lemma.

Lemma V.3: Consider the code where

Let be the set of users which Algorithm 1 pronounces asguilty on input . Then the set is not empty.

Proof: The proof of the lemma relies on the followingclaim.

Claim V.4: Suppose the set is empty. Then for all wehave .

Page 6: Collusion-secure fingerprinting for digital data

1902 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

Proof: The proof is by induction on . For ,the claim is immediate since if user is not guilty then

.Now, we assume the claim holds for and prove

it for . Define

Then the following three conditions are satisfied:

The first condition follows from the fact that. The second is the inductive hypothesis and the third

follows from the fact that user was not pronounced guilty,i.e., . We show that these three conditions imply

which will prove the claim for .

which leads to

Suppose for some constant. Substitutingfor and dividing by we get

It is not difficult to see that for this inequality to be satisfiedwhen we must have . Hence

Proof of Lemma V.3:Suppose is empty. Since userwas not pronounced guilty we know that

On the other hand, for Claim V.4 shows that

This contradiction proves the lemma.

This completes the proof of Theorem V.1.

A. Logarithmic Length -Secure Codes

The -secure code constructed in the previous sectionenables us to use the techniques of [8] to construct-secure

-codes of length . We thank Naor forpointing out this relation. In this section we demonstrate howto apply the simplest technique from [8] to construct a short-secure code from the-secure code of the previous section.

The basic idea is to use the-secure code as the alphabet overwhich the techniques of [8] can be applied.

Let be an -code over an alphabet of sizewherethe codewords are chosen independently and uniformly atrandom.3 The idea is to compose our-secure codewith the code as we did in the proof of Lemma III.2.We call the resulting code . Thus the code

contains codewords and has length. It is made up of copies of . We will refer to

these copies as thecomponentsof . The point isthat the codewords of the codewill be kept hidden from theusers. This is in addition to keeping hidden thepermutationsused when embedding thecopies of in the object.

Theorem V.5:Given integers and set, and . Then,

is a code which is -secure with -error. Thecode contains words and has length

To prove the theorem we show an algorithm that finds amember of the guilty coalition and then prove it’s correctness.

Algorithm 2: Given , find a member of thecoalition that produced .

1) Apply Algorithm 1 to each of the components of .For each component arbitrarily choose oneof the outputs of Algorithm 1. Set to be this chosenoutput. Note that is a number between and . Next,form the word .

2) Find the word which matches in the mostnumber of position (ties are broken arbitrarily).

3) Let be the user whose codeword is derived fromoutput “User is guilty.”

Lemma V.6: Let be a word which was produced by acoalition of at most users. Then with parameters as inTheorem V.5, Algorithm 2 will output a member of withprobability at least .

Proof Sketch:Let be the set of codewords in thatcorrespond to the users in the coalition. For everyAlgorithm 1 guarantees that will match for somewith probability . This follows from the choice ofand the fact that in componentthe users in see words from

. It follows that the above condition will be satisfiedin everycomponent with probability at least . We referto this as event .

Recall that the size of is at most . Therefore, when eventoccurs there must exist a word which matches

3In [8] the codewords ofC are regarded as random hash functionsh: f1; � � � ; Lg ! f1; � � � ; ng.

Page 7: Collusion-secure fingerprinting for digital data

BONEH AND SHAW: COLLUSION-SECURE FINGERPRINTING FOR DIGITAL DATA 1903

in positions. However, since the words inare randomand hidden from the users, any word inwhich is not in

is expected to match in only positions.Using the Chernoff bound it can be shown that the probabilitythat a random word will match in positions is lessthan . Hence, the probability the some word inwill match in positions is at most . This shows thatwhen event occurs, the algorithm will output a member of

with probability at least . Combining this with thefact that event occurs with probability at least provesthe lemma.

VI. A L OWER BOUND

The following theorem provides a lower bound on the lengthof -secure codes.

Theorem VI.1:Let be an fingerprinting schemeover a binary alphabet. Supposeis -secure with error.Then the code length is at least .

Proof: Recall that is a function mapping a userand a random string to a codeword in . As we shall

see, we may regard as a fixed string in which case everyuser is assigned a unique codeword. Let be aset of users. Define to be the set of bit positionswhere users see the value “” exactly times.More precisely:

s.t.

Suppose . We show that a coalitionof users can create a word that cannot be traced. Observethat for all values of the random stringthere must exist a

such that

This follows since

Define the word by

if for someotherwise.

(1)

In other words, is “ ” for all bit positions in blocks ofweight and “ ” in all other bit positions. We show thatall coalitions of users in can create the word

with probability at least . Let be such a coalition.To create the word the coalition must first determinethe value of . Unfortunately, the coalition cannot do sodeterministically (for a fixed it cannot test if a certain bitposition satisfies since it cannotdistinguish between and .)Therefore, to determine the value of the coalition simply

guesses it. Since the correct value is guessedwith probability more than .

Let be a detectable bit position for. Since includes allbut one of the users in it can determine that

for some . However, it cannot differentiatethe case from . Consequently,if the coalition sees the value “” less than times inthe th position (i.e., s.t. )it must be the case that for some . Therefore,the coalition sets such bit positions to “” (which is consistentwith condition (1) defining ). Similarly, if the value “ ”appears at least times in the th position thenfor some . Consequently, the coalition may safelyset this bit position to “.”

The case where the coalition sees “” exactly times is abit harder. In this case, the coalition cannot determine if

or . Hence it cannot deterministicallydecide if the bit should be set to “” or “ .” For such , thecoalition flips an unbiased binary coin and sets theth bit tothe value of the coin. The probability that all such bit positionsare set correctly (i.e., the resulting word is ) is exactly

Hence, the coalition succeeds in generation if two eventsoccur simultaneously: it correctly guessedand it correctlysets the bits in locations where it sees “” times. Theprobability that both events simultaneously happen is at least

. To summarize, we just proved that for any value ofthe random string, any coalition of users incan generate the word with probability at least. It followsthat the word cannot be traced back to a single userwith probability (since no single user belongs to all thesecoalitions). This completes the proof of the theorem.

VII. D ISTRIBUTION SCHEME

Up until now, we have been ignoring distribution of theuniquely fingerprinted copies. This is fair, as at worst wecan send each user an entire unique copy. However, this isimpractical for products such as electronic books, software,or CD-ROM’s which are mass-produced. We would like tocome up with a scheme in which a user receives a bulk ofdata common to all users, and a small number of extra bitsunique to him. We refer to the bulk of data as thepublicdata and denote it by . We refer to the extra bits givenprivately to user as theprivate dataand denote it by .For the distribution scheme to be secure, given , user

should not be able to deduce any information about thefingerprints in copies given to other users.

Throughout this section, let be an -code which isused to fingerprint the objects. We denote the object to bedistributed by and let be its length. Assume the object

can be partitioned into pieces with exactly one mark ineach piece. Let be the th piece which contains thethmark in state . For any , the pieces and

are interchangeable, that is, a copy with one replaced

Page 8: Collusion-secure fingerprinting for digital data

1904 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 5, SEPTEMBER 1998

with the other will behave identically. Given a codeword, let

be the partition set implied by.

Theorem VII.1: It is possible to solve the distribution prob-lem using size public data and privatedata for some fixed constant .

Proof: Let be the length of a single piece. Let be a secure private key

cryptosystem (a symmetric cipher). See [14] for the precisedefinition. The key has length for somefixed constant . The fact that is a secure private keycryptosystem implies that for a random key, given ,no polynomial time predicate can extract one bit of informationabout with nonnegligible probability. This property is crucialfor the security of our distribution scheme.

For each piece , the distributor picks a random key. The public data is

The size of is , twice the size of the original object. Letbe the word associated with user. The private data

given to user is the collection of keys necessary to decrypther pieces

Using this scheme, given , any user can constructa usable copy of the distributed object. The size of the privatedata is .

It is possible to further reduce the size of the private databy collecting the keys into buckets. For instance, bylooking at pairs of keys, each bucket will containtwo keys and . Each of these buckets isencrypted using a bucket key . These encryptionsbecome part of the public data. The appropriate bucket keys aregiven to each user and enable him to open the buckets neededto decrypt his pieces . This technique reduces the sizeof the private data to half of that of the original scheme. Byencoding more bits in each bucket one can reduce the size ofthe private data to at the cost of makingthe public data be long.

It is worth noting that when implementing this distributionscheme, one can use a standard private key cryptosystem suchas DES. Such systems use fixed-length keys. This leads toprivate data of length .

VIII. C ONCLUSIONS

The most significant contribution of these results is to showhow to overcome collusion when fingerprinting digital data. Tosummarize our results, we restrict the size of coalitions to be atmost where is the number of users. For the problem of

naive redistribution considered in Section III, we constructedcodes of length . For the general redistributionproblem considered in Section IV we constructed codes oflength where is the errorprobability. We note that our codes are binary, hence eachhidden mark need only be in one of two states. Finally, wedemonstrated an efficient method for shipping fingerprinteddata which requires only a small constant factor increase inthe size of the data.

There are still many open problems which remain to besolved. The most relevant one is that of constructing shortercollusion secure codes. It seems that an-secure code thatis shorter than the one constructed in Section V should exist.Indeed, there is some gap between the lower bound of SectionVI and our constructed code. A shorter-secure code willlead to an improvement in the general construction of-securecodes as well.

Recall that throughout the paper, we assumed that securemarks can be embedded in the fingerprinted data. A markencodes a bit of information and is secure if it can only bedetected by collusion. To emphasize the fact that we will not bedealing with the implementation of secure marks, we referredto the assumption that they exist as the “Marking Assumption.”In many domains, one can construct secure marks with theaid of problems that are believed to be hard. For instance,when fingerprinting movies, a single mark can be encoded byusing one camera viewpoint versus another. The choice of oneviewpoint versus another in a specific scene, encodes one bitof information in the film. Given an image, the problem oftransforming the image to an image taken from a differentviewpoint is believed to be hard. As this method of markingcan be used to fingerprint movies, we say that the MarkingAssumption holds in the domain of movies.

Showing that the Marking Assumption is satisfied for soft-ware is much harder. As was stated in Section I, there is a greatdeal of empirical evidence to support the existence of securemarks in software. However, to the best of our knowledge,no formal results exist. Progress in this direction would be ofsome practical importance.

ACKNOWLEDGMENT

The authors wish to thank Richard Lipton for some discus-sions on this work. We are grateful to Moni Naor for manysuggestions and comments on this work.

REFERENCES

[1] N. Alon, J. Bruck, J. Naor, M. Naor, and R. Roth, “Construction ofasymptotically good low-rate error-correcting codes through pseudo-random graphs,”IEEE Trans. Inform. Theory, vol. 38, pp. 509–516,1992.

[2] N. Alon and J. Spencer,The Probabilistic Method. New York: Wiley,1992.

[3] I. Biehl and B. Meyer, “Protocols for collusion-secure assymetricfingerprinting,” in Proc. STACS, 1997.

[4] G. Blakley, C. Meadows, and G. Purdy, “Fingerprinting long forgivingmessages,” inProc. Crypto, 1985, pp. 180–189.

[5] D. Boneh and J. Shaw, “Collusion secure fingerprinting for digital data,”in Proc. Crypto’95, LNCS 963, pp. 452–465.

[6] J. Brassil, S. Low, N. Maxemchuk, and L. O’Gorman, “Electronicmarking and identification techniques to discourage document copying,”in Proc.Infocom’94, June 1994, pp. 1278–1287.

Page 9: Collusion-secure fingerprinting for digital data

BONEH AND SHAW: COLLUSION-SECURE FINGERPRINTING FOR DIGITAL DATA 1905

[7] G. Caronni, “Assuring ownership rights for digital images,” inProc.“Reliable IT Systems” (Verlaessliche IT-Systeme) VIS’95H, H. Bruegge-mann and W. Gerhardt-Haeckl, Eds. Germany: Vieweg, 1995.

[8] Y. M. Chee, “Turan-type problems in group testing, coding theory andcryptography,” Ph.D. dissertation, University of Waterloo, Waterloo,Ont., Canada, 1996.

[9] B. Chor, A. Fiat, and M. Naor, “Tracing traitors,” inProc. Crypto’94,pp. 257–270.

[10] I. Cox, J. Kilian, T. Leighton, and T. Shamoon, “A secure robustwatermark for multimedia,” inInformation Hiding, LNCS 1174. NewYork: Springer-Verlag, 1996, pp. 185–206.

[11] DigiMarc Co. [Online]. Available WWW: http://www.digimarc.com.[12] P. Erdos, P. Frankl, and Z. Furedi, “Families of finite sets in which no

set is covered by the union ofr others,” Israel J. Math., vol. 51, pp.79–89, 1985.

[13] D. Glover,The Protection of Computer Software,2nd ed. Cambridge,UK: Cambridge Univ. Press, 1992.

[14] S. Goldwasser, “The search for provably secure cryptosystems,” inAMSLecture Notes Cryptology and Computational Number Theory. 1990.

[15] M. Naor, private communications.[16] B. Pfitzmann and M. Schunter, “Assymetric fingerprinting,” inProc.

Eurocrypt’96, pp. 84–95.[17] B. Pfitzmann and M. Waidner, “Assymetric fingerprinting for large

collusions,” in Proc. 4th ACM Conf. Computer and CommunicationSecurity, 1997.

[18] , “Anonymous fingerprinting,” inProc. Eurocrypt’97, pp. 88–102.[19] B. Schneier,Applied Cryptography. New York: Wiley, 1994.[20] D. R. Stinson and R. Wei, “Combinatorial properties and constructions

of traceability schemes and frameproof codes,”SIAM J. Discr. Math.,to be published.

[21] K. Tanaka, Y. Nakamura, and K. Matsui, “Embedding secret informa-tion into a dithered multi-level image,” inProc. 1990 IEEE MilitaryCommunications Conf., Sept. 1990, pp. 216–220.

[22] van Lint, Introduction to Coding Theory. Berlin, Germany: Springer-Verlag, 1982.

[23] N. Wagner, “Fingerprinting,” inProc. 1983 IEEE Symp. Security andPrivacy, Apr. 1983, pp. 18–22.