Forensic Similarity for Digital Images · Forensic Similarity for Digital Images Owen Mayer, Matthew C. Stamm Abstract—In this paper we introduce a new digital image forensics approach

1

Forensic Similarity for Digital ImagesOwen Mayer, Matthew C. Stamm

Abstract—In this paper we introduce a new digital imageforensics approach called forensic similarity, which determineswhether two image patches contain the same forensic traceor different forensic traces. One benefit of this approach isthat prior knowledge, e.g. training samples, of a forensic traceare not required to make a forensic similarity decision onit in the future. To do this, we propose a two part deep-learning system composed of a CNN-based feature extractorand a three-layer neural network, called the similarity network.This system maps pairs of image patches to a score indicatingwhether they contain the same or different forensic traces. Weevaluated system accuracy of determining whether two imagepatches were 1) captured by the same or different camera model,2) manipulated by the same or different editing operation, and3) manipulated by the same or different manipulation parameter,given a particular editing operation. Experiments demonstrateapplicability to a variety of forensic traces, and importantly showefficacy on “unknown” forensic traces that were not used to trainthe system. Experiments also show that the proposed systemsignificantly improves upon prior art, reducing error rates bymore than half. Furthermore, we demonstrated the utility of theforensic similarity approach in two practical applications: forgerydetection and localization, and database consistency verification.

Index Terms—Multimedia Forensics, Deep Learning, ForgeryDetection

I. INTRODUCTION

TRUSTWORTHY multimedia content is important to anumber of institutions in today’s society, including news

outlets, courts of law, police investigations, intelligence agen-cies, and social media websites. As a result, multimediaforensics approaches have been developed to expose tamperedimages, determine important information about the processinghistory of images, and identify details about the cameramake, model, and device that captured them. These forensicapproaches operate by detecting the visually imperceptibletraces, or “fingerprints,” that are intrinsically introduced bya particular processing operation [1].

Recently, researchers have developed deep learningmethods that target digital image forensic tasks with highaccuracy. For example, convolutional neural network (CNN)based systems have been proposed that accurately detecttraces of median filtering [2], resizing [3], [4], inpainting [5],multiple processing operations [6]–[8], processing order [9],and double JPEG compression [10], [11]. Additionally,researchers have proposed approaches to identify the sourcecamera model of digital images [12]–[15], and identify theirorigin social media website [16].

However, there are two main drawbacks to many of theseexisting approaches. First is that many deep learning systems

O. Mayer and M.C. Stamm are with the Department of Electrical andComputer Engineering, Drexel University, Philadelphia, PA, 19104 USAe-mail: [email protected], [email protected].

This material is based upon work supported by the National ScienceFoundation under Grant No. 1553610.

assume a closed set of forensic traces, i.e. known and closedset of possible editing operations or camera models. Thatis, these methods require prior training examples from aparticular forensic trace, such as the source camera model orediting operation, in order to identify it again in the future.This requirement is a significant problem for forensic analysts,since they are often presented with new or previously unseenforensic traces. Additionally, it is often not feasible to scaledeep learning systems to contain the large numbers of classesthat a forensic investigator may encounter. For example,systems that identify the source camera model of an imageoften require hundreds of scene-diverse images per cameramodel [13]. To scale such a system to contain hundreds orthousands of camera models requires a prohibitively largedata collection undertaking.

A second drawback of these existing approaches is thatmany forensic investigations do not require explicit identifica-tion of a particular forensic trace. For example when analyzinga splicing forgery, which is a composite of content frommultiple images, it is often sufficient to simply detect a regionof an image that was captured by a different source cameramodel, without explicitly identifying that source. That is, theinvestigator does not need to determine the exact processingapplied to the image, or the true sources of the pasted content,just that an inconsistency exists within the image. In anotherexample, when verifying the consistency of an image database,the investigator does not need to explicitly identify whichcamera models were used to capture the database, just thatonly one camera model or many camera models were used.

Recently, researchers have proposed CNN-based forensicsystems for digital images that do not require a closed andknown set of forensic traces. In particular, research in [17]proposed a system to output a binary decision indicatingwhether a particular image was captured by a camera modelin a set of known camera models used for training, or from anunknown camera model that the classifier has not been trainedto identify. Work in [18] showed that features learned by aCNN for source camera model identification can be iterativelyclustered to identify spliced images. The authors showed thatthis type system can detect spliced images even when thecamera models were not used to train the system.

In this paper, we propose a new digital image forensicsapproach that operates on an open set of forensic traces.This approach, which we call forensic similarity, determineswhether two image patches contain the same or differentforensic traces. This approach is different from other forensicsapproaches in that it does not explicitly identify the particularforensic traces contained in an image patch, just whether theyare consistent across two image patches. The benefit of thisapproach is that prior knowledge of a particular forensic traceis not required to make a similarity decision on it.

arX

iv:1

902.

0468

4v1

[cs

.CV

] 1

3 Fe

b 20

19

2

Feature Extractor

Feature Extractor

SimilarityFunction

f(X1 )

f(X2)

≶ηS(f(X1),f(X2)) C(X1,X2)

X1

X2

Fig. 1. Forensic similarity system overview.

To do this, we propose a two part deep learning system.In the first part, called the feature extractor, we use a CNNto extract general low-dimensional forensic features, calleddeep features, from an image patch. Prior research has shownthat CNNs can be trained to map image patches onto a low-dimensional feature space that encodes general and high-levelforensic information about that patch [3], [14], [17]–[20].Next, we use a three layer neural network to map pairs ofthese deep features onto a similarity score, which indicateswhether the two image patches contain the same forensic traceor different forensic traces.

We experimentally evaluate the efficacy of our proposedapproach in several scenarios. We evaluate the ability of ourproposed forensic similarity system at determining whethertwo image patches were 1) captured by the same or differentcamera model, 2) manipulated the same or different editingoperation, and 3) manipulated same or different manipulationparameter, given a particular editing operation. Importantly,we evaluate performance on camera models, manipulationsand manipulation parameters not used during training, demon-strating that this approach is effective in open-set scenarios.

Furthermore, we demonstrate the utility of this approachin two practical applications that a forensic analyst mayencounter. In the first application, we demonstrate that ourforensic similarity system detects and localizes image forg-eries. Since image forgeries are often a composite of contentcaptured by two camera models, these forgeries are exposedby detecting the image regions that are forensically dissimilarto the host image. In the second application, we show thatthe forensic similarity system verifies whether a database ofimages was captured by all the same camera model, or bydifferent camera models. This is useful for flagging socialmedia accounts that violate copyright protections by stealingcontent, often from many different sources.

This paper is an extension of our previous work in [19]. Inour previous work, we proposed a proof-of-concept similaritysystem for comparing the source camera model of two imagepatches, and evaluated on a limited set of camera models.In this work, we extend [19] in several ways. First, wereframe the approach as a general system that is applicableto any measurable forensic trace, such as manipulation typeor editing parameter, not just source camera model. Second,we significantly improve the system architecture and trainingprocedure, which led to an over 50% reduction in classificationerror. Finally, we experimentally evaluate this approach in anexpanded range of scenarios, and demonstrate utility in twopractical applications.

The remaining parts of the manuscript are outlined asfollows. In Sec. II, we motivate and formalize the concept offorensic similarity. In Sec. III, we detail our proposed deep-

learning system implementation and training procedure. Inthis section, we describe how to build and train the CNN-based feature extractor and similarity network. In Sec. IV,we evaluate the effectiveness of our proposed approach in anumber of forensic situations, and importantly effectiveness onunknown forensic traces. Finally, in Sec. V-A, we demonstratethe utility of this approach in two practical applications.

II. FORENSIC SIMILARITY

Prior multimedia forensics approaches for digital imageshave focused on identifying or classifying a particular forensictrace (e.g. source camera model, processing history) in animage or image patch. These approaches, however, sufferfrom two major drawbacks in that 1) training samples froma particular trace are required to identify it, and 2) not allforensic analyses require identification of a trace. For example,to expose a splicing forgery it is sufficient to identify thatthe forged image is simply composite content from differentsources, without needing to explicitly identify those sources.

In this paper, we propose a new general approach thataddresses these drawbacks. We call this approach forensicsimilarity. Forensic similarity is an approach that determinesif two image patches have the same or different forensictrace. Unlike prior forensic approaches, it does not identifya particular trace, but still provides important forensicinformation to an investigator. The main benefit of this typeof approach is that it is able to be practically implemented inopen-set scenarios. That is, a forensic similarity based systemdoes not inherently require training samples from a forensictrace in order to make a forensic similarity decision. Later, inSec. III, we describe how this approach is implemented usinga CNN-based deep learning system.

In this work, we define a forensic trace to be a signalembedded in an image that is induced by, and capturesinformation about, a particular signal processing operationperformed on it. Forensic traces are inherently unrelated to theperceptual content of the image; two images depicting differentscenes may contain similar forensic traces, and two imagesdepicting similar scenes may contain different forensic traces.Common mechanisms that induce a forensic trace an imageare: the camera model that captured the image, the socialmedia website where the image was downloaded from, andthe processing history of the image. A number of approacheshave been researched to extract and identify the forensic tracesrelated to these mechanisms [7], [14], [21].

These prior approaches, however, assume a closed set offorensic traces. They are designed to perform a mappingX → Y where X is the space of image patches and Y isthe space of known forensic traces that are used to train thesystem, e.g. camera models or editing operations. Howeverwhen an input image patch has a forensic trace y /∈ Y,the identification system is still forced to map to the spaceY, leading to erroneous an result. That is, the system willmisclassify this new “unknown” trace as a “known” one in Y.

This is problematic since in practice forensic investigatorsare often presented with images or image patches that contain aforensic trace for which the investigator does not have training

3

examples. We call these unknown forensic traces. This maybe a camera model that does not exist in the investigator’sdatabase, or previously unknown editing operation. In thesescenarios, it is still important to glean important forensic aboutthe image or image patch.

To address this, we propose a system that is capable ofoperating on unknown forensic traces. Instead of buildinga system to identify a particular forensic trace, we ask thequestion “do these two image patches contain the sameforensic trace?” Even though the forensic similarity systemmay have never seen a particular forensic trace before, it isstill able to distinguish whether they are the same or differentacross two patches. This type of question is analogous to thecontent-based image retrieval problem [22], and the speakerverification problem [23].

We define forensic similiarity as the function

C : X× X→ {0, 1} (1)

that compares two image patches. This is done by mappingtwo input image patches X1, X2 ∈ X to a score indicatingwhether the two image patches have the same or differentforensic trace. A score of 0 indicates the two image patchescontain different forensic traces, and a score of 1 indicatesthey contain the same forensic trace. In other words

C(X1, X2) =

{0 if X1, X2 diff. forensic traces,1 if X1, X2 same forensic trace.

(2)

To construct this system, we propose a forensic similaritysystem consisting of two main conceptual parts, which areshown in the system overview in Fig. 1. The first conceptualpart is called the feature extractor

f : X→ RN , (3)

which maps an input image X to a real valued N-dimensionalfeature space. This feature space encodes high-level forensicinformation about the image patch X . Recent research inmultimedia forensics has shown that convolutional neuralnetworks (CNNs) are powerful tools for extracting general,high-level forensic information from image patches [20]. Wespecify how this is done in Sec. III, where we describe ourproposed implementation of the forensic similarity system.

Next we define the second conceptual part, the similarityfunction

S : RN × RN → [0, 1] , (4)

that maps pairs of forensic feature vectors to a similarity scorethat takes values from 0 to 1. A low similarity score indicatesthat the two image patches X1 and X2 have dissimilar forensictraces, and a similarity score indicates that the two forensictraces are highly similar.

Finally, we compare the similarity score S(f(X1), f(X2))of two image patches X1 and X2 to a threshold η such that

C(X1, X2) =

{0 if S(f(X1), f(X2)) ≤ η1 if S(f(X1), f(X2)) > η

. (5)

In other words, the proposed forensic similarity system takestwo image patches X1 and X2 as input. A feature extractormaps these two input image patches to a pair of feature vectors

f(X1) and f(X2), which encode high-level forensic informa-tion about the image patches. Then, a similarity function mapsthese two feature vectors to a similarity score, which is thencompared to a threshold. A similarity score above the thresholdindicates that X1 and X2 have the same forensic trace (e.g.processing history or source camera model), and a similarityscore below the threshold indicates that they have differentforensic traces.

III. PROPOSED APPROACH

In this section, we describe our proposed deep learningsystem architecture and associated training procedure forforensic similarity. In our proposed forensic similarity archi-tecture and training procedure, we build upon successful CNN-based techniques used in multimedia forensics literature, aswell as propose a number of innovations that are aimed atextracting robust forensic features from images and accuratelydetermining forensic similarity between two image patches.

Our proposed forensic similarity system consists of twoconceptual elements: 1) a CNN-based feature extractor thatmaps an input image onto a low-dimensional feature spacethat encodes high level forensic information, and 2) a three-layer neural network, which we call the similarity network,that maps pairs of these features to a score indicating whethertwo image patches contain the same forensic trace. This systemis trained in two successive phases. In the first phase, calledLearning Phase A, we train the feature extractor. In thesecond phase, called Learning Phase B we train the similaritynetwork. Finally, in this section we describe an entropy-basedmethod of patch selection, which we use to filter out patchesthat are not suitable for forensic analysis.

A. Learning Phase A - Feature Extractor

Here we describe the deep-learning architecture and trainingprocedure of the feature extractor that maps an input imagepatch onto a low dimensional feature space, which encodesforensic information about the patch. This is the mappingdescribed by (3). Later, pairs these feature vectors are used asinput to the similarity network described below in Sec. III-B.

Developments in machine learning research have shown thatCNNs are powerful tools to use as generic feature extractors.This is done by robustly training a deep convolutional neuralnetwork for a particular task, and then using the neuronactivations, at a deep layer in the network, as a featurerepresentation of an image [24]. These neuron activationsare called “deep features,” and are often extracted from thelast fully connected layer of a network. Research has shownthat deep features extracted from a CNN trained for objectrecognition tasks can be used to perform scene recognitiontasks [25] and remote sensing tasks [26].

In multimedia forensics research, it has been shown thatdeep features based approaches also very powerful for digitalimage forensics tasks. For example, work in [14] showed thatdeep features from a network trained on one set of cameramodels can be used to train an support vector machine toidentify a different set of camera models. Work in [17] showedthat deep features from a CNN trained on a set of cameramodels can be used to determine whether the camera model

4

Feature Extractor

Hard sharingElem. mult.

Similarity neuron

conv1

conv2

conv3

conv4

conv5

fc1 fc2 fcA

fcB

Hard sharing

Feature Extractor

f(X1)finter (X

1 )

S(f(X1),f(X2))

f inter(X2)

f(X2)

fsim(X1,X2)

X1

X2concatenate

fconcat(X1,X2)

ConvolutionBatch NormalizationActivationPooling

Neuron (w/ Activation)

Key

Similarity Network

Fig. 2. The neural network architecture of the proposed forensic similarity system. The system is composed of a pair of CNN-based feature extractors, in ahard sharing (Siamese) configuration, which feed low-dimensional, high-level forensic feature vectors to the similarity network. The similarity network is aneural network that maps feature vectors from two image patches to a similarity score indicating whether they contain the same or different forensic traces.

of the input image model was used to train the system.Furthermore, it has been shown that deep features from aCNN trained for camera model identification transfer verywell to other forensic tasks such manipulation detection [20],suggesting that deep features related to digital forensics aregeneral to a variety of forensics tasks.

1) Architecture: To build a forensic feature extractor, weadapt the MISLnet CNN architecture developed in [7], whichhas been utilized in a number of works that target differentdigital image forensics tasks including manipulation detec-tion [6], [7], [27] and camera model identification [20], [27].Briefly, this CNN consists of 5 convolutional blocks, labeled‘conv1’ through ‘conv5’ in Fig. 2 and two fully connectedlayers labeled ‘fc1’ and ‘fc2’. Each convolutional block, withthe exception of the first, contains a convolutional layer fol-lowed by batch normalization, activation, and finally a poolingoperation. The two fully connected layers, labeled ‘fc1’ and‘fc2,’ each consist of 200 neurons with hyperbolic tangentactivation. Further details of this CNN are found in [7].

To use this CNN as a deep feature extractor, an image patchis fed forward through the (trained) CNN. Then, the activatedneuron values in the last fully connected layer, ‘fc2’ in Fig. 2,are recorded. These recorded neuron values are then used as afeature vector that represents high-level forensic informationabout the image. The extraction of deep-features from a imagepatch is the mapping in (3), where the feature dimensionN = 200 corresponding to the number of neurons in ‘fc2.’

The architecture of this CNN-based feature extractor issimilar to the architecture we used in our prior work in [19].However, in this work we alter the CNN architecture inthree ways to improve the robustness of the feature extractor.First, we use full color image patches in RGB as input tothe network, instead of just the green color channel usedpreviously. Since many important forensic features are ex-pressed across different color channels, it is important forthe network to learn these feature representations. This isdone by modifying each 5×5 convolutional kernel to be ofdimension 5×5×3, where the last dimension corresponds toimage’s the color channel. Second, we relax the constraintimposed on the first convolutional layer in [7] which is usedto encourage the network to learn prediction error residuals.

While this constraint is useful for forensics tasks of singlechannel image patches, it is not immediately translatable tocolor images. Third, we double the number of kernels in thefirst convolutional layer from 3 to 6, to increase the expressivepower of the network.

The feature extractor architecture is depicted by each ofthe two identical ‘Feature Extractor’ blocks in Fig. 2. Inour proposed system, we use two identical feature extractors,in ‘Siamese’ configuration [28], to map two input imagepatches X1 and X2 to a feature space f(X1) and f(X2).This configuration ensures that the system is symmetric, i.e.the ordering of X1 and X2 does not matter. We refer tothe Siamese feature extractor blocks as using hard sharing,meaning that the exact same weights and biases are sharedbetween the two blocks.

2) Training Methodology: In our proposed approach wefirst train the feature extractor during Learning Phase A. To dothis, we add an additional fully-connected layer with softmaxactivation to the feature extractor architecture. We provide thefeature extractor with image patches and labels associated withthe forensic trace of each image patch. Then, the we iterativelytrain the network using stochastic gradient descent with across-entropy loss function. Training is performed for 30epochs with an initial learning rate of 0.001, which is halvedevery three epochs, and a batch size of 50 image patches.

During Learning Phase A we train the feature extractornetwork on a closed set of forensic traces referred to as“known forensic traces.” Research in [20] found that traininga CNN in this way yields deep-feature representations thatare general to other forensic tasks. In particular, it was shownthat when a feature extractor was trained for camera modelidentification, it was very transferable to other forensic tasks.Because of this, during Learning Phase A we train the featureextractor on a large set of image patches with labels associatedwith their source camera model.

In this work, we train two versions of the feature extractornetwork: one feature extractor that uses 256×256 imagepatches as input and another that uses 128×128 image patchesas input. We note that to decrease the patch size further wouldrequire substantial architecture changes due to the poolinglayers. In each case, we train the network using 2,000,000

5

image patches from the 50 camera models in the “Cameramodel set A” found in Table I.

The feature extractor is then updated again in LearningPhase B, as described below in Sec. III-B. This is significantlydifferent than in our previous work in [19], where the featureextractor remains frozen after Learning Phase A. In ourexperimental evaluation in Sec. IV-A, we show that allowingthe feature extractor network to update during Learning PhaseB significantly improves system performance.

B. Learning Phase B - Similarity Network

Here, we describe our proposed neural network architecturethat maps a pair of forensic feature vectors f(X1) and f(X2)to a similarity score ∈ [0, 1] as described in (4). The similarityscore, when compared to a threshold, indicates whether thepair of image patches X1 and X2 have the same or differentforensic traces. We call this proposed neural network thesimilarity network, and is depicted in the right-hand side ofFig. 2. Briefly, the network consists of 3 layers of neurons,which we view as a hierarchical mapping of two input featuresvectors to successive feature spaces and ultimately an outputscore indicating forensic similarity.

1) Architecture: The first layer of neurons, labeled by ‘fcA’in Fig. 2, contains 2048 neurons with ReLU activation. Thislayer maps an input feature vector f(X) to a new, intermediatefeature space finter(X). We use two identical ‘fcA’ layers, inSiamese (hard sharing) configuration, to map each of the inputvectors f(X1) and f(X2) into finter(X1) and finter(X2).

This mapping for the kth value of the intermediate featurevector is calculated by an artificial neuron function:

fk,inter (X) = φ

(N∑i=0

wk,i fi (X) + bk

), (6)

which is the weighted summation, with weights wk,0 throughwk,N , of the N = 200 elements in the deep-feature vectorf(X), bias term bk and subsequent activation by ReLU func-tion φ(·). The weights and bias for each element of finter(X)are arrived at through stochastic gradient descent optimizationas described below.

Next the second layer of neurons, labeled by ‘fcB’ in Fig. 2,contains 64 neurons with ReLU activation. As input to thislayer, we create a vector

fconcat(X1, X2) =

finter(X1)finter(X2)

finter(X1)� finter(X2)

, (7)

that is the concatenation of finter(X1), finter(X2) andfinter(X1)�finter(X2), where � is the element-wise productoperation. This layer maps the input vector fconcat(X1, X2)to a new ‘similarity’ feature space fsim(X1, X2) ∈ R64 usingthe artificial neuron mapping described in (6). This similarityfeature space encodes information about the relative forensicinformation between patches X1 and X2.

Finally, a single neuron with sigmoid activation maps thesimilarity vector fsim(X1, X2) to a single score. We callthis neuron the ‘similarity neuron,’ since it outputs a singlescore ∈ [0, 1], where a small value indicates X1 and X2

contain different forensic traces, and larger values indicate theycontain the same forensic trace. To make a decision, we com-pare the similarity score to a threshold η typically set to 0.5.

The proposed similarity network architecture differs fromour prior work in [19] in that we increase the number ofneurons in ‘fcA’ from 1024 to 2048, and we add to the concate-nation vector the elementwise multiplication of finter(X1) andfinter(X2). Work in [29] showed that the element-wise prod-uct of feature vectors were powerful for speaker verificationtasks in machine learning systems. These additions increasethe expressive power of the similarity network, and as a resultimprove system performance.

2) Training Methodology: Here, we describe the secondstep of the forensic similarity system training procedure,called Learning Phase B. In this learning phase, we train thesimilarity network to learn a forensic similarity mapping forany type of measurable forensic trace, such as whether twoimage patches were captured by the same or different cameramodel, or manipulated by the same or different editingoperation. We control which forensic traces are targeted bythe system with the choice of training sample and labelsprovided during training.

Notably, during Learning Phase B, we allow the error toback propagate through the feature extractor and update thefeature extractor weights. This allows the feature extractor tolearn better feature representations associated with the typeof forensic trace targeted in this learning phase. Allowingthe feature extractor to update during Learning Phase Bsignificantly differs from the implementation in [19], whichused a frozen feature extractor.

We train the similarity network (and update the featureextractor simultaneously) using stochastic gradient descentfor 30 epochs, with an initial learning rate of 0.005 whichis halved every three epochs. The descriptions of trainingsamples and associated labels used in learning Phase B aredescribed in Sec. IV, where we investigate efficacy on differenttypes of forensic traces.

C. Patch Selection

Some image patches may not contain sufficient informationto be reliably analyzed for forensics purposes [31]. Here,we describe a method for selecting image patches that areappropriate for forensic analysis. In this paper we use anentropy based selection method to filter out pairs of imagepatches prior to analyzing their forensic similarity. This filteris employed during evaluation only and not while training.

To do this, we view a forensic trace as an amount ofinformation encoded in an image that has been induced bysome processing operation. An image patch is a channelthat contains this information. From this channel we extractforensic information, via the feature extractor, and then com-pare pairs of these features using the similarity network.Consequently, an image patch must have sufficient capacityin order to encode forensic information.

When evaluating pairs of image patches, we ensure that bothpatches have sufficient capacity to encode a forensic trace by

6

TABLE ICAMERA MODELS USED IN TRAINING (SETS A AND B) AND TESTING (SET C). NOTE THAT A ∩B = A ∩ C = B ∩ C = ∅.

∗DENOTES FROM THE DRESDEN IMAGE DATABASE [30]

Camera model set AApple iPhone 4 Canon PC1730 Huawei Honor 5x Nikon Coolpix S7000 Praktica DCZ5.9∗ Sony DSC-W800Apple iPhone 4s Canon Powershot A580 LG G2 Nikon Coolpix S710∗ Ricoh GX100∗ Sony DSC-WX350Apple iPhone 5 Canon Powershot ELPH 160 LG G3 Nikon D200∗ Rollei RCP-7325XS∗ Sony DSC-H50∗Apple iPhone 5s Canon Powershot S100 LG Nexus 5x Nikon D3200 Samsung Galaxy Note4 Sony DSC-T77∗Apple iPhone 6 Canon Powershot SX530 HS Motorola Droid Maxx Nikon D7100 Samsung Galaxy S2 Sony NEX-5TLApple iPhone 6+ Canon Powershot SX420 IS Motorola Droid Turbo Panasonic DMC-FZ50∗ Samsung Galaxy S4Apple iPhone 6s Canon SX610 HS Motorola X Panasonic FZ200 Samsung L74wide∗Agfa Sensor530s∗ Casio EX-Z150∗ Motorola XT1060 Pentax K-7 Samsung NV15∗Canon EOS SL1 Fujifilm FinePix S8600 Nikon Coolpix S33 Pentax OptioA40∗ Sony DSC-H300

Camera model set BApple iPad Air 2 Canon Ixus70∗ Fujifilm FinePix XP80 LG Nexus 5 Olympus Stylus TG-860 Samsung Galaxy S3Apple iPhone 5c Canon PC1234 Fujifilm FinePix J50∗ Motorola Nexus 6 Panasonic TS30 Samsung Galaxy S5Agfa DC-733s∗ Canon Powershot G10 HTC One M7 Nikon D70∗ Pentax OptioW60∗ Samsung Galaxy S7Agfa DC-830i∗ Canon Powershot SX400 IS Kodak EasyShare C813 Nikon D7000 Samsung Galaxy Note3 Sony A6000Blackberry Leap Canon T4i Kodak M1063∗ Nokia Lumia 920 Samsung Galaxy Note5 Sony DSC-W170∗

Camera model set CAgfa DC-504∗ Canon Powershot A640∗ LG Realm Olympus mju-1050SW∗ Samsung Galaxy Note2Agfa Sensor505x∗ Canon Rebel T3i Nikon Coolpix S3700 Samsung Galaxy Lite Samsung Galaxy S6 EdgeCanon Ixus55∗ LG Optimus L90 Nikon D3000 Samsung Galaxy Nexus Sony DSC-T70

measuring their entropy. Here, entropy h is defined as

h = −255∑k=0

pk ln (pk) , (8)

where pk is the probability that a pixel has luminance value kin the image patch. Entropy h is measured in nats. We estimatepk by measuring the proportion of pixels in an image patchthat have luminance value k.

When evaluating image patches, we ensure that both imagepatches have entropy between 1.8 and 5.2 nats. We chosethese values since 95% of image patches in our database fallwithin this range. Intuitively, the minimum threshold for ourpatch selection method eliminates flat (e.g. saturated) imagepatches, which would appear the same regardless of cameramodel or processing history. This method also removes patcheswith very high entropy. In this case, there is high pixel valuevariation in the image that may obfuscate the forensic trace.

IV. EXPERIMENTAL EVALUATION

We conducted a series of experiments to test the efficacy ofour proposed forensic similarity system in different scenarios.In these experiments, we tested our system accuracy in deter-mining whether two image patches were 1) captured by thesame or different camera model, 2) manipulated by the sameor different editing operation, and 3) manipulated by the sameor different manipulation parameter, given a particular editingoperation. These scenarios were chosen for their variety intypes of forensic traces and because those traces are targetedin forensic investigations [7], [14], [15]. Additionally, we con-ducted experiments that examined properties of the forensicsimilarity system, including: the effects of patch size and post-compression, comparison to other similarity measures, and theimpact of network design and training procedure choices.

The results of these experiments show that our proposedforensic similarity system is highly accurate for comparing avariety of types of forensic traces across two image patches.

Importantly, these experiments show this system is accurateeven on “unknown” forensic traces that were not used totrain the system. Furthermore, the experiments show that ourproposed system significantly improves upon prior art in [19],reducing error rates by over 50%.

To do this, we started with a database of 47,785 imagescollected from 95 different camera models, which are listed inTable I. Images from 26 camera models were collected as partof the Dresden Image Database “Natural images” dataset [30].The remaining 69 camera models were from our own databasecomposed of point-and-shoot, cellphone, and DSLR camerasfrom which we collected at minimum 300 images with diverseand varied scene content. The camera models were split intothree disjoint sets, A, B, and C. Images from A were used totrain the feature extractor in Learning Phase A, images fromA and B were used to train the similarity network in LearningPhase B, and images from C were used for evaluation only.First, set A was selected by randomly selecting 50 cameramodels from among those for which there were at least 40,000non-overlapping 256×256 patches were available. Next, cam-era model set B was selected by randomly choosing 30 cameramodels, from among the remaining, which contained had least25,000 total non-overlapping 256×256 patches. Finally, theremaining 15 camera models were assigned to C.

In all experiments, we started with a pre-trained featureextractor that was trained from 2,000,000 randomly chosenimage patches from camera models in A (40,000 patches permodel) with labels corresponding to their camera model, asdescribed in Sec. III. For all experiments, we started with thisfeature extractor since research in [20] showed that deep fea-tures related to camera model identification are a good startingpoint for extracting other types of forensic information.

Next, in each experiment we conducted Learning Phase Bto target a specific type of forensic trace. To do this, wecreated a training dataset of pairs of image patches. Thesepairs were selected by randomly choosing 400,000 imagepatches of size 256×256 from images in camera model sets

7

Casio EX-Z150

LG Nexus 5x

Motorola X

Nikon CoolPixS710

Panasonic DMC-FZ50

Ricoh GX100

Samsung Galaxy Note4

iPhone 5s

iPhone 6iPhone 6s

Agfa DC-504

Agfa Sensor505-x

Canon Ixus55

Canon PowerShotA640

Canon Rebel T3i

LG Optimus L90

LG RealmNikon Coolpix S3700

Nikon D3000

Olympus mju-1050SW

Samsung Galaxy Lite

Samsung Galaxy Nexus

Samsung Galaxy Note2

Samsung Galaxy S6 Edge

Sony CyberShot DSC-T70Patch 1 Camera Model

Casio EX-Z150LG Nexus 5xMotorola XNikon CoolPixS710Panasonic DMC-FZ50Ricoh GX100Samsung Galaxy Note4iPhone 5siPhone 6iPhone 6sAgfa DC-504Agfa Sensor505-xCanon Ixus55Canon PowerShotA640Canon Rebel T3iLG Optimus L90LG RealmNikon Coolpix S3700Nikon D3000Olympus mju-1050SWSamsung Galaxy LiteSamsung Galaxy NexusSamsung Galaxy Note2Samsung Galaxy S6 EdgeSony CyberShot DSC-T70

Patc

h 2

Cam

era

Mod

el

1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.73 1.00 1.00 1.00 1.000.84 0.79 1.00 1.00 1.00 0.99 0.67 0.99 0.98 1.00 1.00 1.00 1.00 0.98 0.21 0.25 1.00 1.00 1.00 0.97 0.86 0.91 0.92 1.00

0.98 1.00 1.00 1.00 0.94 0.91 0.98 0.98 1.00 1.00 1.00 1.00 0.98 0.44 0.69 1.00 1.00 1.00 1.00 0.12 0.70 1.00 1.001.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 1.00 1.00 1.00 1.00 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00

0.99 0.99 1.00 0.55 1.00 1.00 1.00 1.00 1.00 1.00 0.84 1.00 1.00 1.00 1.00 0.83 0.95 1.00 0.92 1.00 0.941.00 1.00 0.97 1.00 1.00 0.98 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.89 0.99 1.00 1.00 1.00 1.00 0.99

0.98 0.92 0.98 0.97 1.00 1.00 1.00 1.00 0.98 0.95 0.93 1.00 1.00 1.00 0.96 0.88 0.73 0.99 1.000.99 0.83 0.85 1.00 1.00 0.99 1.00 0.68 0.61 0.47 1.00 0.95 0.91 0.61 0.78 0.10 0.93 0.99

0.99 0.23 1.00 1.00 1.00 1.00 1.00 0.93 0.90 1.00 1.00 1.00 0.99 0.95 0.87 1.00 1.000.96 1.00 1.00 1.00 1.00 1.00 0.87 0.79 1.00 1.00 1.00 0.99 0.94 0.94 1.00 1.00

0.94 0.89 1.00 1.00 1.00 1.00 1.00 0.68 0.87 1.00 1.00 1.00 1.00 1.00 0.990.81 0.98 0.98 0.89 1.00 1.00 0.97 0.98 0.97 0.91 1.00 1.00 1.00 0.96

0.99 0.02 0.97 1.00 1.00 1.00 1.00 0.97 0.99 1.00 1.00 1.00 1.000.98 0.97 1.00 1.00 1.00 1.00 0.98 0.99 1.00 1.00 1.00 1.00

0.98 0.97 0.96 1.00 1.00 0.72 0.24 0.98 0.95 0.74 0.930.92 0.28 1.00 1.00 1.00 1.00 0.54 0.81 1.00 1.00

0.81 1.00 1.00 1.00 0.97 0.74 0.71 0.98 1.000.94 0.35 1.00 1.00 1.00 1.00 1.00 1.00

0.88 0.99 0.99 1.00 0.99 0.98 1.000.87 0.27 1.00 0.98 0.98 0.98

0.99 0.98 0.90 0.15 0.990.91 0.64 1.00 1.00

0.96 0.98 1.000.95 1.00

0.99

Both patches from "Known" camera models

Both patches from "Unknown" camera models

1 patch "Known" 1 patch "Unknown"

Know

n Ca

mer

a M

odel

sUn

know

n Ca

mer

a M

odel

s

Known Camera Models Unknown Camera Models

Fig. 3. Camera model correct comparison rates for 25 different camera models. Same camera model correct classification rates are on the diagonal, anddifferent camera model correct classification rates are in the non-diagonal entries. Ten camera models were used in training, i.e. “Known” camera models,and 15 were not used in training, i.e. are “Unknown” with respect to the classifier. Color scales with classification rate.

A and B, with 50% of patch pairs chosen from the samecamera model, and 50% from different camera models. Forexperiments where the source camera model was compared,a label of 0 or 1 was assigned to each pair corresponding towhether they were captured by different or the same cameramodel. For experiments where we compared the manipulationtype or manipulation parameter, these image patches werethen further manipulated (as described in each experimentbelow) and a label assigned indicating the same or differentmanipulation type/parameter. Training was performed usingTensorflow v1.10.0 with a Nvidia GTX 1080 Ti.1

To evaluate system performance, we created an evaluationdataset of 1,200,000 pairs of image patches, which wereselected by randomly choosing 256×256 image patches fromthe 15 camera models in set C (“unknown” camera modelsnot used in training). We also included image patches from10 camera models randomly chosen from set A. One devicefrom each of these 10 “known” camera models was withheldfrom training, and only images from these devices wereused in this evaluation dataset. For experiments where wecompared the manipulation type or manipulation parameter,the pairs of image patches in the evaluation dataset werethen further manipulated (as described in each experimentbelow) and assigned a label indicating the same or differentmanipulation type/parameter.

A. Source Camera Model Comparison

In this experiment, we tested the efficacy of our proposedforensic similarity approach for determining whether twoimage patches were captured by the same or different cameramodel. To do this, during Learning Phase B we trained thesimilarity network using the an expanded training dataset

1Pre-trained models and example code for each experiment are avail-able from the project repository at gitlab.com/MISLgit/forensic-similarity-for-digital-images and our laboratory website misl.ece.drexel.edu/downloads/.

of 1,000,000 pairs of 256×256 image patches selected fromcamera models in A and B, with labels corresponding towhether the source camera model was the same or different.Evaluation was then performed on the evaluation dataset of1,200,000 pairs chosen from camera models in A (known)and C (unknown).

Fig. 3 shows the accuracy of our proposed forensic sim-ilarity system, broken down by camera model pairing. Thediagonal entries of the matrix show the correct classificationrates of when two image patches were captured by the samecamera model. The non-diagonal entries of the matrix showthe correct classification rates of when two image patcheswere captured by different camera models. For example, whenboth image patches were captured by a Canon Rebel T3iour system correctly identified their source camera model as“the same” 98% of the time. When one image patch wascaptured by a Canon PowerShot A640 and the other imagepatch was captured by a Nikon CoolPix S710, our systemcorrectly identified that they were captured by different cameramodels 99% of the time.

The overall classification accuracy for all cases was 94.00%.The upper-left region shows classification accuracy for whentwo image patches were captured by known camera models,Casio EX-Z150 through iPhone 6s. The total accuracy forthe known versus known cases was 95.93%. The upper-rightregion shows classification accuracy for when one patch wascaptured by an unknown camera model, Agfa DC-504 throughSony Cybershot DSC-T70, and the other patch was capturedby a known camera model. The total accuracy for the knownversus unknown cases was 93.72%. The lower-right regionshows classification accuracy for when both image patcheswere captured by unknown camera models. For the unknownversus unknown cases, the total accuracy was 92.41%. Thisresult shows that while the proposed forensic similarity systemperforms better on known camera models, the system is accu-

8

TABLE IIACCURACY OF CAMERA MODEL COMPARISON ON DIFFERENT PATCH

TYPES. PATCH PAIRS FROM 10 KNOWN (K), 15 UNKNOWN (U ) CAMERAMODELS. WITH OR WITHOUT SECOND JPEG COMPRESSION QF=95.

Patch size JPEG K,K K,U U,U Total

256×256 None 95.19% 93.27% 92.22% 93.61%128×128 None 94.54% 91.22% 90.07% 92.02%256×256 QF=95 94.00% 91.22% 90.15% 91.83%128×128 QF=95 90.55% 87.64% 87.37% 88.63%

rate on image patches captured by unknown camera models.In the majority of camera model pairs, our proposed forensic

similarity system is highly accurate, achieving >95% accuracyin 257 of the 325 unique pairings of all camera models, and 95of the 120 possible pairs of unknown camera models. Thereare also certain pairs where the system does not acchieve highcomparison accuracy. Many of these cases occurred when twoimage patches were captured by similar camera models of thesame manufacturer. As an example, when one camera modelwas an iPhone 6 and the other an iPhone 6s, the system onlyachieved a 26% correct classification rate. This was likely dueto the similarity in hardware and processing pipeline of bothof these cellphones, leading to very similar forensic traces.This phenomenom was also observed in the cases of CanonPowershot A640 versus Canon Ixus 55, any combination ofLG phones, Samsung Galaxy S6 Edge versus Samsung GalaxyLite, and Nikon Coolpix S3700 versus Nikon D3000.

The results of this experiment show that our proposedforensic similarity system is effective at determining whethertwo image patches were captured by the same or differentcamera model, even when the camera models were unknown,i.e. not used to train the system. This experiment also showsthat, while the system achieves high accuracy in most cases,there are certain pairs of camera models where the system doesnot achieve high accuracy and this often due to the underlyingsimilarity of the camera model systems themselves.

1) Patch Size and Re-Compression Effects: A forensicinvestigator may encounter smaller image patches and/orimages that have undergone additional compression. Inthis experiment, we examined the performance of ourproposed system when presented with input images that haveundergone a second JPEG compression and when the patchsize is reduced to a of size 128×128.

To do this, we repeated the above source camera modelcomparison experiment in four scenarios: input patches withsize 256×256, input patches of size 128×128, JPEG re-compressed patches of size 256×256, and finally JPEG re-compressed patches of size 128×128. We first created a copiesof the training dataset and evaluation dataset, but where eachimage was JPEG compressed by quality factor 95 beforeextracting patches. We then trained the similarity network(Learning Phase B) in each of the four scenarios. For ex-periments with 128×128 patches, we used the same 256×256patches but cropped so only the top-left corner remained.

Source camera model comparison accuracy for each sce-nario is shown in Table II. The column K,K indicates whenboth patches were from known camera models, K,U whenone patch was from a known camera model and one from an

TABLE IIICOMPARISON TO OTHER SIMILARITY MEASURES

Distance Measure Accuracy Learned Measure Accuracy

1 Norm 93.06% MS‘18 [19] 85.70%2 Norm 93.28% ER Trees 92.44%Inf. Norm 91.73% SVM 92.84%Bray-Curtis [18] 92.57% Proposed 93.61%Cosine 92.87%

unknown camera model, and finally U,U indicates when bothwere from unknown camera models. Generally, classificationaccuracy decreased when using the smaller 128×128 patchesand when secondary JPEG compression was introduced. Forexample, total classification rates decreased from 93.61% for256×256 patches to 92.64% for 128×128 patches withoutcompression and from 91.83% for 256×256 patches to 88.63%for 128×128 patches with compression.

The results of this experiment show that operating at a finerresolution and/or introducing secondary JPEG compressionnegatively impacts source camera model comparison accuracy.However, the proposed system is still able to operate at arelatively high accuracy in these scenarios.

2) Other Approaches: In this experiment, we comparedthe accuracy of our proposed approach to other approachesincluding distance metrics, support vector machines (SVM),extremely randomized trees (ER Trees), and prior art in [19].

For the machine learning approaches, we trained eachmethod on deep features of the training dataset extracted bythe feature extractor after Learning Phase A. We did thisto emulate Learning Phase B where the machine learningapproach is used in place of our proposed similarity network.We compared a support vector machine (SVM) with RBFkernel γ = 0.01, C = 1.0, and an extremely randomizedtrees (ER Trees) classifier with 800 estimators and minimumsplit depth of 3. We also compared to the method proposedin [19], and used the same training and evaluation data aswith our proposed method.

For the distance measures, we extracted deep features fromthe evaluation set after Learning Phase B. This was done togive a more fair comparison to the machine learning systems,which have the benefit of the additional training data. Wemeasured the distance between each pair of deep features andcompared to a threshold. The threshold for each approach waschosen to be the one that maximized total accuracy.

The total classification accuracy achieved on the evaluationset is shown in Table III, with the proposed system accuracy of93.61% shown for reference. For the fixed distance measures,the 2-Norm distance achieved the highest accuracy of 93.28%,and the Infinite Norm distance achieved the lowest accuracy,among those tested, at 91.73%. The Bray-Curtis distance,which was used in [18] to cluster image patches based onsource camera model, achieved an accuracy of 92.57%.

For the learned measures, the ER Trees classifier achievedan accuracy of 92.44% and the SVM achieved an accuracy of92.84%, both lower than our proposed similarity system. Wealso compared against the system proposed in our previouswork [19], which achieved a total accuracy of 85.70%. Theresults of this experiment show that our proposed system

9

TABLE IVPERFORMANCE OF DIFFERENT TRAINING METHODS.

Training Data Feature Extractor Accuracy

B Frozen 90.24%B Unfrozen 90.96%AB Frozen 92.56%AB Unfrozen 93.61%

outperforms other distance measures and learned similaritymeasures. The experiment also shows the system proposedin this paper significantly improves upon prior work in [19],and decreased the system error rate by over 50%.

3) Training methods: In this experiment, we examinedthe effects of two design aspects in the Learning Phase Btraining procedure. In particular, these aspects are 1) allowingthe feature extractor to update, i.e. unfrozen during training,and 2) using a diverse training dataset. This experiment wasconducted to explicitly compare to the training procedurein [19], where the feature extractor was not updated (frozen)in Learning Phase B and only a subset of available trainingcamera models were used.

To do this, we created an additional training database of400,000 image patch pairs of size 256×256, mimicing theoriginal training dataset, but containing only image patchescaptured by camera models in set B. This was done since theprocedure in [19] specified to conduct Learning Phase B oncamera models that were not used in Learning Phase A. Werefer to this as training set B, and the original training setas AB. We then performed Learning Phase B using each ofthese datasets. Furthermore, we repeated each training scenariowhere the learning rate multiplier in each layer in the featureextractor layer was set to 0, i.e. the feature extractor wasfrozen. This was done to compare to the procedure in [19]which used a frozen feature extractor.

The overall accuracy achieved by each of the four scenariosis shown in Table IV. When using training on set B with afrozen feature extractor, which is the same procedure usedin [19], the total accuracy on the evaluation image patcheswas 90.24%. When allowing the feature extractor to update,accuracy increased by 0.72 percentage points to 90.96%. Whenincreasing training data diversity to camera model set AB, butusing a frozen feature extractor the accuracy achieved was92.56%. Finally, when using a diverse dataset and an unfrozenfeature extractor, total accuracy achieved was 93.61%.

The results of this experiment show that our proposed train-ing procedure is a significant improvement over the procedureusing in [19], improving accuracy 3.37 percentage points.Furthermore, we can see the added benefit of our proposedarchitecture enhancements when comparing the result MS‘18in Table III, which uses both the training procedure and systemarchitecture of [19]. Improving the system architecture aloneraised classification rates from 85.70% to 90.24%. Improvingthe training procedure further raised classification rates to93.61%, together reducing the error rate by more than half.

B. Editing Operation Comparison

A forensic investigator is often interested in determiningwhether two image patches have the same processing history.

TABLE VKNOWN MANIPULATIONS USED IN TRAINING AND UNKNOWN

MANIPULATIONS USED IN EVALUATION, WITH ASSOCIATED PARAMETERS

Manipulation Parameter Value Range

Known ManipulationsUnaltered − −Resizing (bilinear) Scaling factor [0.6, 0.9] ∪ [1.1, 1.9]Gaussian blur (5×5) σ [1.0, 2.0]Meidan blur Kernel size {3, 5, 7}AWG Noise σ [1.5, 2.5]JPEG Compression Quality factor {50, 51, 52, . . . , 95}Unsharp mask (r = 2, t = 3) Percent [50, 200]Adaptive Hist. Eq. − −

Unknown ManipulationsWeiner filter Kernel size {3, 5, 7}Web dithering − −Salt + pepper noise Percent {5, 6, 7, . . . , 20}

In this experiment, we investigated the efficacy of our proposedapproach for determining whether two image patches were ma-nipulated by the same or different editing operation, including“unknown” editing operations not used to train the system.

To do this, we started with the training database of imagepatch pairs. We then modified each patch with one of theeight “known” manipulations in Table V, with a randomlychosen editing parameter. We manipulated 50% of the imagepatch pairs with the same editing operation, but with differentparameter, and manipulated 50% of the pairs with differentediting operations. The known manipulations were the samemanipulations used in [6] and [8]. We repeated this for theevaluation database, using both the “known” and “unknown”manipulations. Wiener filtering was performed using the SciPypython library, web dithering was performed using the PythonImage Library, and salt and pepper noise was performed usingthe SciPy image processing toolbox (skimage). We note thatthe histogram equlaization and JPEG compression manipula-tions were performed on the whole image. We then performedLearning Phase B using the manipulated training database,with labels associated with each pair corresponding to whetherthey have been manipulated by the same or different editingoperation. Finally, we evaluated accuracy on the evaluationdataset, with patches processed in a similar manner.

Table VI shows the correct classification rates of our pro-posed forensic similarity system, broken down by manipu-lation pairing. The first eight columns show rates for whenone patch was edited with a known manipulation and theother patch was edited with an unknown manipulation. Thelast three columns show rates for when both patches wereedited by unknown manipulations. For example, when oneimage patch was manipulated with salt and pepper noise andthe other patch was manipulated with histogram equalization,our proposed system correctly identified that they have beenedited by different manipulations at a rate of 95%. When bothimage patches were edited with Wiener filtering, our proposedsystem correctly identified that they were edited by the samemanipulation at a rate of 96%. The total accuracy for theknown versus known cases was 97.0%, but are not shownfor the sake of brevity.

There are certain pairs of manipulations for which the

10

TABLE VICORRECT CLASSIFICATION RATES FOR COMPARING MANIPULATION TYPE OF TWO IMAGE PATCHES, WITH KNOWN AND UNKNOWN MANIPULATIONS.

Known Manipulations Unknown Manipulations

Manip. Type Orig. Resize Gauss. Blur Med. Blur AWGN JPEG Sharpen Hist. Eq. Wiener Web Dither Salt Pepper

Wiener 1.00 0.94 0.03 0.87 1.00 1.00 1.00 1.00 0.96 1.00 1.00Web Dither 0.95 1.00 1.00 1.00 0.80 1.00 0.06 0.92 1.00 1.00 0.02Salt Pepper 0.94 1.00 1.00 1.00 0.99 1.00 0.03 0.95 1.00 0.02 1.00

0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5Patch 2 Resizing Factor

0.60.70.80.91.01.11.21.31.41.5

Patc

h 1

Resiz

ing

Fact

or

0.98 0.044 0.11 0.91 0.99 1 0.96 0.99 0.99 10.96 0.055 0.84 0.98 1 0.95 1 0.99 0.99

0.92 0.76 0.98 1 0.91 1 0.99 0.990.96 1 0.99 0.77 0.98 0.98 0.99

0.99 0.99 0.99 0.99 0.99 0.990.98 0.67 0.99 1 1

0.78 0.42 0.61 0.990.99 0.12 0.99

0.91 0.670.99

Fig. 4. Correct classification rates for comparing the resizing parameterin two image patches. Unknown scaling factors are {08, 1.2, 1.4}. Bluehighlights one patch with unkown scaling factor, red highlights both patcheshave unknown scaling factors.

proposed system does not achieve high comparison accuracy.These include Wiener filtering versus Gaussian bluring, webdithering versus sharpening, salt and pepper versus sharpening,and web dithering versus salt and pepper noise. The firstexample is likely due to the smoothing similarities betweenWiener filtering and Gaussian blurring. The latter cases arelikely due to the addition of similar high frequency artifactsintroduced by the sharpening, web dithering, and salt and pep-per manipulations. Despite these cases, our proposed systemachieves high accuracy even when one or both manipulationsare unknown in the majority of manipulation pairs.

The results of this experiment demonstrate that our proposedforensic similarity is system is effective at comparing theprocessing history of image patches, even when image patcheshave undergone an editing operation that was unknown, i.e. notused during training.

C. Editing Parameter Comparison

In this experiment, we investigated the efficacy of ourproposed approach for determining whether two image patcheshave been manipulated by the same or different manipulationparameter. Specifically, we examined pairs of image patchesthat had been resized by the same scaling factor or that hadbeen resized by different scaling factors, including “unknown”scaling factors that were not used during training. This typeof analysis is important when analyzing spliced images whereboth the host image and foreign content were resized, but theforeign content was resized by a different factor.

To do this, we started with the training database of imagepatch pairs. We then resized each patch with one of the seven“known” resizing factors in {0.6, 0.7, 0.9, None, 1.1, 1.3, 1.5}using bilinear interpolation. We resized 50% of the imagepatch pairs with the same scaling factor, and resized 50%of the pairs with different scaling factors. We repeated thisfor the evaluation database, using both the “known” scaling

factors and “unknown” scaling factors in {0.8, 1.2, 1.4}. Wethen performed Learning Phase B using the training databaseof resized image patches, with labels corresponding to whethereach pair of image patches was resized by the same or differentscaling factor.

The correct classification rates of our proposed approachare shown in Fig. 4, broken down by tested resizing factorpairings. For example, when one image patch was resized bya factor of 0.8 and the other image patch was resized bya factor of 1.4, both unkown scaling factors, our proposedsystem correctly identified that the image patches were resizedby different scaling factors at rate of 99%. Cases where at leastone patch has been resized with an unknown scaling factorare highlighted in blue. Cases where both patches have beenresized with an unknown scaling factor our outlined in red.

Our system achieves greater than 90% correct classificationrates in 33 of 45 tested scaling factor pairings. There arealso some cases where our proposed system does not achievehigh accuracy. These cases tend to occur when presented withimage patches that have been resized with different but similarresizing factors. For example, when resizing factors of 1.4 and1.3 are used, the system correctly identifies the scaling factoras different 12% of the time.

The results of this experiment show that our proposedapproach is effective at comparing the manipulation parameterin two image patches, a third type of forensic trace. This exper-iment shows that our proposed approach is effective even whenone or both image patches have been manipulated by an un-known parameter of the editing operation not used in training.

V. PRACTICAL APPLICATIONS

The forensic similarity approach is a versatile techniquethat is useful in many different practical applications. In thissection, we demonstrate how a forensic similarity is used intwo types of forensic investigations: image forgery detectionand localization, and image database consistency verification.

A. Forgery detection and localization

Here we demonstrate the utility of our proposed forensicsimilarity system in the important forensic analysis of forgedimages. In forged images, an image is altered to change itsperceived meaning. This can be done by inserting foreign con-tent from another image, as in a splicing forgery, or by locallymanipulating a part of the image. Forging an image inherentlyintroduces a localized inconsistency of the forensic traces inthe image. We demonstrate that our proposed similarity systemdetects and localizes the forged region of an image by exposingthat it has a different forensic trace than the rest of the image.Importantly, we show that our proposed system is effective on

11

(a) Original (b) Spliced (c) Host reference (d) Spliced reference (e) Host ref, original image

(f) Cam. traces 256x256 (g) Cam. traces 128x128 (h) Cam. + JPG traces, 128x128 (i) Manip. traces, 256x256

Fig. 5. Splicing detection and localization example. The green box outlines a reference patch. Patches, spanning the image with 50% overlap, that are detectedas forensically different from the reference patch are highlighted in red. Image downloaded from www.reddit.com.

(a) Spliced Image (b) Host reference, 256x256 (c) Host reference, 128x128

Fig. 6. Splicing detection and localization example. The green box outlines a reference patch. Patches, spanning the image with 50% overlap, that are detectedas forensically different from the reference patch are highlighted in red.

(a) Original (b) Manipulated (c) Host reference (d) Host reference, 128x128

Fig. 7. Manipulation detection and localization example. The green box outlines a reference patch. Patches, spanning the image with 50% overlap, that aredetected as forensically different from the reference patch are highlighted in red.

12

“in-the-wild” forged images, which are visually realistic andhave been downloaded from a popular social media website.

We do this on three forged images that were downloadedfrom www.reddit.com, for which we also have access to theoriginal version. First, we subdivided each forged image intoimage patches with 50% overlap. Next, we selected one imagepatch as a reference patch and calculated the similarity scoreto all other image patches in the image. We used the similaritysystem trained in Sec. IV-A to determine whether two imagepatches were captured by the same or different camera modelwith secondary JPEG compression. We then highlighted theimage patches with similarity scores less than a threshold, i.e.contain a different forensic trace than the reference patch.

Results from this procedure on the first forged image areshown in Fig. 5. The original image is shown in Fig. 5a. Thespliced version is shown Fig. 5b, where an actor was splicedinto the image. When we selected a reference patch from thehost (original) part of the image, the image patches in thespliced regions were highlighted as forensically different asshown in Fig. 5c. We note that our forensic similarity basedapproach is agnostic to which regions are forged and whichare authentic, just that they have different forensic signatures.This is seen in Fig. 5d when we selected a spliced imagepatch as the reference patch. Fig. 5e shows then when weperformed this analysis on the original image, our forensicsimilarity system does not find any forensically differentpatches from the reference patch.

The second row of Fig. 5 shows forensic similarity analysisusing networks trained under different scenarios. Results usingthe network trained to determine whether two image patcheshave the same or different source camera model withoutJPEG post-compression are shown in Fig. 5f for patch size256×256, in Fig. 5g for patch size 128×128, and with JPEGpost-compression in Fig. 5h for patch size 128×128. Theresult using the network trained to determine whether twoimage patches have been manipulated by the same or differentmanipulation type is shown in Fig. 5i.

Results from splicing detection and localization procedureon a second forged image are shown in Fig. 6, where a set oftoys were spliced into an image of a meeting of governmentofficials. When we selected reference patches from the hostimage, the spliced areas were correctly identified as containinga different forensic traces, exposing the image as forged. Thisis seen in Fig. 6b with 256×256 patches, and in Fig. 6cwith 128×128 patches. The 128×128 case showed betterlocalization of the spliced region and additionally identified theyellow airplanes as different than the host image, which werenot identified by the similarity system using larger patch sizes.

In a final example, shown in Fig. 7, the raindrop stains on amans shirt were edited out the image by a forger using a brushtool. When we selected a reference patch from the uneditedpart of the image, the manipulated regions were identifiedas forensically different, exposing the tampered region of theimage. This is seen in Fig. 7c with 256×256 patches, andin Fig. 7d with 128×128 patches. In the 128×128 case, thesmaller patch size was able to correctly expose that the man’sshirt sleeve was also edited.

The results of presented in this section show that our

TABLE VIIRATES OF CORRECTLY IDENTIFYING A DATABASE AS “CONSISTENT” (I.E.

ALL THE SAME CAMERA MODEL) OR “INCONSISTENT,” M=10, N=20

threshold Type 0 Type 1 Type 2

0.1 97.4% 84.8% 100%0.5 92.4% 91.9% 100%0.9 70.7% 96.7% 100%

proposed forensic similarity based approach is a powerfultechnique that can be used to detect and localize imageforgeries. These results showed that just correctly identifyingthat forensic differences existed in the images was sufficient toexpose the forgery, even though the technique did not identifythe particular forensic traces in the image.

B. Database consistency verification

In this section, we demonstrate a that the forensic similaritysystem detects whether a database of images has either beencaptured by all the same camera model, or by different cameramodels. This is an important task for social media websites,where some accounts illicitly steal copyrighted content frommany different sources. We refer to these accounts as “contentaggregators”, who upload images captured by many differentcamera models. This type of account contrasts with “contentgenerator” accounts, who upload images captured by one cam-era model. In this experiment, we show how forensic similarityis used to differentiate between these types of accounts.

To do this, we generated three types of databases of images.Each database contained M images and were assigned to oneof three “Types.” Type 0 databases contained M images takenby the same camera model, i.e. a content generator database.Type 1 databases contained M−1 images taken by one cameramodel, and 1 image taken by a different camera model. Finally,Type 2 databases contain M images, each taken by differentcamera models. We consider the Type 1 case the hardest todifferentiate from a Type 0 databse, whereas the Type 2 caseis the easiest to detect. We created 1000 of each database typefrom images captured by camera models in set C, i.e. unknowncamera models not used in training, with the camera modelsrandomly chosen for each database.

To classify a database as consistent or inconsistent, weexamined each

(M2

)unique image pairings of the database.

For each image pair, we randomly selected N 256×256 imagepatches from each image and calculated the N2 similarityscores across the two images. Similarity was calculated us-ing the similarity network trained in Sec. IV-A. Then, wecalculated the median value of scores for each whole-imagepair. For image pairs captured by the same camera model thisvalue is high, and for two images captured by different cameramodels this value is low. We then compare the (M − 2)thlowest value calculated from the entire database to a threshold.If this (M − 2)th lowest value is above the threshold, weconsider the database to be consistent, i.e. from a contentgenerator. If this value is below the threshold, then we considerthe database to be inconsistent, i.e. from a content aggregator.

Table VII shows the rates at which we correctly classifyType 0 databases as “consistent” (i.e. all from the same camera

13

model) and Type 1 and Type 2 databases as “inconsistent”,with M = 10 images per database, and N = 20 patcheschosen from each image. For a threshold 0.5, we correctly clas-sified 92.4% of Type 0 databases as consistent, and incorrectlyclassified 8.1% of Type 1 databases as consistent. This incor-rect classification rate of Type 1 databases is decreased by in-creasing the threshold. Even at a very low threshold of 0.1, oursystem correctly identified all Type 2 databases as inconsistent.

The results of this experiment show that our proposed foren-sic similarity system is effective for verifying the consistencyof an image database, an important type of practical forensicanalysis. Since none of the images used in this experimentwere from camera models used to train the forensic system, anidentification type approach would not have been appropriate.In this application it was not important to identify whichcamera models were used in a particular database, but whetherthe images came from the same or different camera models.

VI. CONCLUSION

In this paper we proposed a new digital image forensicstechnique, called forensic similarity, which determines whethertwo image patches contain the same or different forensictraces. The main benefit of this approach is that prior knowl-edge, e.g. training samples, of a forensic trace are not requiredto make a forensic similarity decision on it. To do this, we pro-posed a two part deep-learning system composed of a CNN-based feature extractor and a three-layer neural network, calledthe similarity network, which maps pairs of image patches ontoa score indicating whether they contain the same or differentforensic traces. We experimentally evaluated the performanceof our approach on three types of common forensic scenarios,which showed that our proposed system was accurate in avariety of settings. Importantly, the experiments showed thissystem is accurate even on “unknown” forensic traces thatwere not used to train the system and that our proposedsystem significantly improved upon prior art in [19], reducingerror rates by over 50%. Furthermore, we demonstrated theutility of the forensic similarity approach in two practicalapplications of forgery splicing and localization, and imagedatabase consistency verification.

REFERENCES

[1] M. C. Stamm, M. Wu, and K. Liu, “Information forensics: An overviewof the first decade,” Access, IEEE, vol. 1, pp. 167–200, 2013.

[2] J. Chen, X. Kang, Y. Liu, and Z. J. Wang, “Median filtering foren-sics based on convolutional neural networks,” IEEE Signal ProcessingLetters, vol. 22, no. 11, pp. 1849–1853, 2015.

[3] B. Bayar and M. C. Stamm, “On the robustness of constrained convo-lutional neural networks to jpeg post-compression for image resamplingdetection,” in ICASSP, 2017 IEEE. IEEE, 2017, pp. 2152–2156.

[4] J. Bunk, J. H. Bappy, T. M. Mohammed, L. Nataraj, A. Flenner,B. Manjunath, S. Chandrasekaran, A. K. Roy-Chowdhury, and L. Pe-terson, “Detection and localization of image forgeries using resamplingfeatures and deep learning,” in Computer Vision and Pattern RecognitionWorkshops (CVPRW). IEEE, 2017, pp. 1881–1889.

[5] X. Zhu, Y. Qian, X. Zhao, B. Sun, and Y. Sun, “A deep learning approachto patch-based image inpainting forensics,” Signal Processing: ImageCommunication, 2018.

[6] B. Bayar and M. C. Stamm, “A deep learning approach to universalimage manipulation detection using a new convolutional layer,” inWorkshop on Info. Hiding and Multimedia Sec. ACM, 2016, pp. 5–10.

[7] ——, “Constrained convolutional neural networks: A new approachtowards general purpose image manipulation detection,” IEEE Trans-actions on Information Forensics and Security, 2018.

[8] M. Boroumand and J. Fridrich, “Deep learning for detecting processinghistory of images,” Electronic Imaging, vol. 2018, no. 7, pp. 1–9, 2018.

[9] B. Bayar and M. C. Stamm, “Towards order of processing operations de-tection in jpeg-compressed images with convolutional neural networks,”Electronic Imaging, vol. 2018, no. 7, pp. 1–9, 2018.

[10] M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, M. Maggini,B. Tondi, and S. Tubaro, “Aligned and non-aligned double jpeg detectionusing convolutional neural networks,” Journal of Visual Communicationand Image Representation, vol. 49, pp. 153–163, 2017.

[11] I. Amerini, T. Uricchio, L. Ballan, and R. Caldelli, “Localization ofJPEG double compression through multi-domain convolutional neuralnetworks,” in IEEE CVPR Workshop on Media Forensics, vol. 3, 2017.

[12] A. Tuama, F. Comby, and M. Chaumont, “Camera model identificationwith the use of deep convolutional neural networks,” in InformationForensics and Security (WIFS), Workshop on. IEEE, 2016, pp. 1–6.

[13] L. Bondi, D. Guera, L. Baroffio, P. Bestagini, E. J. Delp, and S. Tubaro,“A preliminary study on convolutional neural networks for camera modelidentification,” Electronic Imaging, vol. 2017, no. 7, pp. 67–76, 2017.

[14] L. Bondi, L. Baroffio, D. Guera, P. Bestagini, E. J. Delp, and S. Tubaro,“First steps toward camera model identification with convolutionalneural networks,” IEEE Signal Processing Letters, pp. 259–263, 2017.

[15] B. Bayar and M. C. Stamm, “Augmented convolutional feature maps forrobust CNN-based camera model identification,” in Image Processing(ICIP), 2017 IEEE International Conference on. IEEE, 2017, pp. 1–4.

[16] I. Amerini, T. Uricchio, and R. Caldelli, “Tracing images back totheir social network of origin: a CNN-based approach,” in InformationForensics and Security (WIFS), Workshop on. IEEE, 2017, pp. 1–5.

[17] B. Bayar and M. C. Stamm, “Towards open set camera model identifica-tion using a deep learning framework,” in Acoustics, Speech and SignalProcessing (ICASSP), Int. Conference on. IEEE, 2018, pp. 1–4.

[18] L. Bondi, S. Lameri, D. Guera, P. Bestagini, E. J. Delp, and S. Tubaro,“Tampering detection and localization through clustering of camera-based CNN features,” in Computer Vision and Pattern RecognitionWorkshops. IEEE, 2017, pp. 1855–1864.

[19] O. Mayer and M. C. Stamm, “Learned forensic source similarity forunknown camera models,” in Acoustics, Speech and Signal Processing(ICASSP), Int. Conference on. IEEE, 2018, pp. 1–4.

[20] O. Mayer, B. Bayar, and M. C. Stamm, “Learning unified deep-featuresfor multiple forensic tasks,” in Proceedings of the 6th ACM Workshopon Information Hiding and Multimedia Security. ACM, 2018, pp. 1–6.

[21] B. Bayar and M. C. Stamm, “A generic approach towards imagemanipulation parameter estimation using convolutional neural networks,”in Proceedings of the 5th ACM Workshop on Information Hiding andMultimedia Security. ACM, 2017, pp. 147–157.

[22] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deeplearning for content-based image retrieval: A comprehensive study,” inInternational Conference on Multimedia. ACM, 2014, pp. 157–166.

[23] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in Acoustics, Speech and Signal Pro-cessing (ICASSP), Int. Conference on. IEEE, 2016, pp. 5115–5119.

[24] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “Decaf: A deep convolutional activation feature for genericvisual recognition,” in International Conference on Machine Learning,2014, pp. 647–655.

[25] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learningdeep features for scene recognition using places database,” in Advancesin neural information processing systems, 2014, pp. 487–495.

[26] O. A. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep featuresgeneralize from everyday objects to remote sensing and aerial scenesdomains?” in Proc. of the IEEE CVPR Workshops, 2015, pp. 44–51.

[27] B. Bayar and M. C. Stamm, “Design principles of convolutional neuralnetworks for multimedia forensics,” Elec. Imaging, pp. 77–86, 2017.

[28] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah, “Signatureverification using a ”siamese” time delay neural network,” in Advancesin neural information processing systems, 1994, pp. 737–744.

[29] H.-S. Lee, Y. Tso, Y.-F. Chang, H.-M. Wang, and S.-K. Jeng, “Speakerverification using kernel-based binary classifiers with binary operationderived features,” in Acoustics, Speech and Signal Processing (ICASSP),International Conference on. IEEE, 2014, pp. 1660–1664.

[30] T. Gloe and R. Bohme, “The’dresden image database’for benchmarkingdigital image forensics,” in Proceedings of the 2010 ACM Symposiumon Applied Computing. ACM, 2010, pp. 1584–1590.

[31] D. Guera, S. K. Yarlagadda, P. Bestagini, F. Zhu, S. Tubaro, andE. J. Delp, “Reliability map estimation for CNN-based camera modelattribution,” arXiv preprint arXiv:1805.01946, 2018.

Documents

Forensic Similarity for Digital Images · Forensic Similarity for Digital Images Owen Mayer, Matthew C. Stamm Abstract—In this paper we introduce a new digital image forensics approach