14
This paper is included in the Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST ’16). February 22–25, 2016 • Santa Clara, CA, USA ISBN 978-1-931971-28-7 Open access to the Proceedings of the 14th USENIX Conference on File and Storage Technologies is sponsored by USENIX Estimating Unseen Deduplication— from Theory to Practice Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov, IBM Research—Haifa https://www.usenix.org/conference/fast16/technical-sessions/presentation/harnik

Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

Embed Size (px)

Citation preview

Page 1: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

This paper is included in the Proceedings of the 14th USENIX Conference on

File and Storage Technologies (FAST ’16).February 22–25, 2016 • Santa Clara, CA, USA

ISBN 978-1-931971-28-7

Open access to the Proceedings of the 14th USENIX Conference on

File and Storage Technologies is sponsored by USENIX

Estimating Unseen Deduplication— from Theory to Practice

Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov, IBM Research—Haifa

https://www.usenix.org/conference/fast16/technical-sessions/presentation/harnik

Page 2: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277

Estimating Unseen Deduplication – from Theory to Practice

Danny Harnik, Ety Khaitzin and Dmitry SotnikovIBM Research–Haifa

{dannyh, etyk, dmitrys}@il.ibm.com

AbstractEstimating the deduplication ratio of a very largedataset is both extremely useful, but genuinelyvery hard to perform. In this work we present anew method for accurately estimating deduplica-tion benefits that runs 3X to 15X faster than thestate of the art to date. The level of improve-ment depends on the data itself and on the storagemedia that it resides on. The technique is basedon breakthrough theoretical work by Valiant andValiant from 2011, that give a provably accuratemethod for estimating various measures while see-ing only a fraction of the data. However, for the usecase of deduplication estimation, putting this theoryinto practice runs into significant obstacles. In thiswork, we find solutions and novel techniques to en-able the use of this new and exciting approach. Ourcontributions include a novel approach for gaugingthe estimation accuracy, techniques to run it withlow memory consumption, a method to evaluate thecombined compression and deduplication ratio, andways to perform the actual sampling in real storagesystems in order to actually reap benefits from thesealgorithms. We evaluated our work on a number ofreal world datasets.

1 Introduction1.1 Deduplication and EstimationAfter years of flourishing in the world of back-ups, deduplication has taken center stage and isnow positioned as a key technology for primarystorage. With the rise of all-flash storage systemsthat have both higher cost and much better randomread performance than rotating disks, deduplicationand data reduction in general, makes more sensethan ever. Combined with the popularity of modernvirtual environments and their high repetitiveness,consolidating duplicate data reaps very large bene-fits for such high-end storage systems. This trendis bound to continue with new storage class memo-ries looming, that are expected to have even betterrandom access and higher cost per GB than flash.

This paper is about an important yet extremelyhard question – How to estimate the deduplicationbenefits of a given dataset? Potential customersneed this information in order to make informeddecisions on whether high-end storage with dedu-plication is worthwhile for them. Even more so, the

question of sizing and capacity planning is deeplytied to the deduplication effectiveness expected onthe specific data. Indeed, all vendors of deduplica-tion solutions have faced this question and unfortu-nately there are no easy solutions.

1.2 The hardness of Deduplication Es-timation and State of the Art

The difficulty stems from the fact that deduplicationis a global property and as such requires searchingacross large amounts of data. In fact, there are the-oretical proofs that this problem is hard [12] andmore precisely, that in order to get an accurate esti-mation one is required to read a large fraction of thedata from disk. In contrast, compression is a localprocedure and therefore the compression estimationproblem can be solved very efficiently [11].

As a result, the existing solutions in the markettake one of two approaches: The first is simply togive an educated guess based on prior knowledgeand based on information about the workload athand. For example: Virtual Desktop Infrastructure(VDI) environments were reported (e.g. [1]) to givean average 0.16 deduplication ratio (a 1:6 reduc-tion). However in reality, depending on the specificenvironment, the results can vary all the way be-tween a 0.5 to a 0.02 deduplication ratio (between1:2 and 1:50). As such, using such vague estimationfor sizing is highly inaccurate.

The other approach is a full scan of the data athand. In practice, a typical user runs a full scan onas much data as possible and gets an accurate es-timation, but only for the data that was scanned.This method is not without challenges, since eval-uating the deduplication ratio of a scanned datasetrequires a large amount of memory and disk opera-tions, typically much higher than would be allowedfor an estimation scan. As a result, research on thetopic [12, 19] has focused on getting accurate esti-mations with low memory requirement, while stillreading all data from disk (and computing hasheson all of the data).

In this work we study the ability to estimatededuplication while not reading the entire dataset.

1.3 Distinct Elements CountingThe problem of estimating deduplication has sur-faced in the past few years with the popularity of

Page 3: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

278 14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

the technology. However, this problem is directlylinked to a long standing problem in computer sci-ence, that of estimating the number of distinct el-ements in a large set or population. With motiva-tions ranging from Biology (estimating the numberof species) to Data Bases (distinct values in a ta-ble/column), the problem received much attention.There is a long list of heuristic statistical estima-tors (e.g. [4, 9, 10]), but these do not have tightguarantees on their accuracy and mostly target setsthat have a relatively low number of distinct ele-ments. Their empirical tests perform very poorlyon distributions with a long tail (distributions inwhich a large fraction of the data has low duplica-tion counts) which is the common case with dedu-plication. Figure 1 shows examples of how inaccu-rate heuristic estimations can be on a real dataset.It also shows how far the deduplication ratio of thesample can be from that of the entire dataset.

Figure 1: An example of the failure of current samplingapproaches: Each graph depicts a deduplication ratio es-timate as a function of the sampling percent. The pointsare an average over 100 random samples of the samepercent. We see that simply looking at the ratio on thesample gives a very pessimistic estimation while the esti-mator from [4] is always too optimistic in this example.

This empirical difficulty is also supported by the-oretical lower bounds [16] proving that an accurateestimation would require scanning a large fractionof the data. As a result, a large bulk of the work ondistinct elements focused on low-memory estima-tion on data streams (including a long list of studiesstarting from [7] and culminating in [14]). Theseestimation methods require a full scan of the dataand form the foundation for the low-memory scansfor deduplication estimation mentioned in the pre-vious section.

In 2011, in a breakthrough paper, Valiant andValiant [17] showed that at least Ω( n

logn ) of thedata must be inspected and more over, that there is amatching upper bound. Namely, they showed a the-oretical algorithm that achieves provable accuracyif at least O( n

log n ) of the elements are examined.

Subsequently, a variation on this algorithm was alsoimplemented by the same authors [18]. Note thatthe Valiants work, titled “Estimating the unseen” ismore general than just distinct elements estimationand can be used to estimate other measures such asthe Entropy of the data (which was the focus in thesecond paper [18]).

This new ”Unseen” algorithm is the startingpoint of our work in which we attempt to deployit for deduplication estimation.

1.4 Our WorkMore often than not, moving between theory andpractice is not straightforward and this was defi-nitely the case for estimating deduplication. In thiswork we tackle many challenges that arise whentrying to successfully employ this new technique ina practical setting. For starters, it is not clear thatperforming random sampling at a small granular-ity has much benefit over a full sequential scan ina HDD based system. But there are a number ofdeeper issues that need to be tackled in order to ac-tually benefit from the new approach. Following isan overview of the main topics and our solutions:

Understanding the estimation accuracy. Theproofs of accuracy of the Unseen algorithm are the-oretic and asymptotic in nature and simply do nottranslate to concrete real world numbers. Moreover,they provide a worst case analysis and do not giveany guarantee for datasets that are easier to analyze.So there is no real way to know how much data tosample and what fraction is actually sufficient. Inthis work we present a novel method to gauge theaccuracy of the algorithm. Rather than return anestimation, our technique outputs a range in whichthe actual result is expected to lie. This is practicalin many ways, and specifically allows for a gradualexecution: first take a small sample and evaluate itsresults and if the range is too large, then continue byincreasing the sample size. While our tests indicatethat a 15% sample is sufficient for a good estimationon all workloads, some real life workloads reach agood enough estimation with a sample as small as3% or even less. Using our method, one can stopearly when reaching a sufficiently tight estimation.

The memory consumption of the algorithm. Inreal systems, being able to perform the estimationwith a small memory footprint and without addi-tional disk IOs is highly desirable, and in somecases a must. The problem is that simply runningthe Unseen algorithm as prescribed requires map-ping and counting all of the distinct chunks in thesample. In the use case of deduplication this num-ber can be extremely large, on the same order ofmagnitude as the number of chunks in the entire

Page 4: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 279

dataset. Note that low memory usage also benefitsin lower communication bandwidth when the esti-mation is performed in a distributed system.

Existing solutions for low-memory estimation ofdistinct elements cannot be combined in a straight-forward manner with the Unseen algorithm. Rather,they require some modifications and a careful com-bination. We present two such approaches: onewith tighter accuracy, and a second that loses moreestimation tightness but is better for distributed andadaptive settings. In both cases the overall algo-rithm can run with as little as 10MBs of memory.Combining deduplication and compression.Many of the systems deploying deduplication alsouse compression to supplement the data reduction.It was shown in multiple works that this combina-tion is very useful in order to achieve improved datareduction for all workloads [5, 13, 15]. A naturalapproach to estimating the combined benefits is toestimate each one separately and multiply the ratiosfor both. However, this will only produce correctresults when the deduplication effectiveness andcompression effectiveness are independent, whichin some cases is not true. We present a method to in-tegrate compression into the Unseen algorithm andshow that it yields accurate results.

How to perform the sampling? As stated above,performing straightforward sampling at a smallgranularity (e.g. 4KB) is extremely costly in HDDbased systems (in some scenarios, sampling as littleas 2% may already take more than a full scan). In-stead we resort to sampling at large “super-chunks”(of 1MB) and performing reads in a sorted fashion.Such sampling runs significantly faster than a fullscan and this is the main source of our time gains.

Equally as important, we show that our methodscan be tuned to give correct and reliable estimationsunder this restriction (at the cost of a slightly looserestimation range). We also suggest an overall sam-pling strategy that requires low memory, producessorted non-repeating reads and can be run in grad-ual fashion (e.g. if we want to first read a 5% sam-ple and then enlarge the sample to a 10% one).

Summary of our results. In summary, we designand evaluate a new method for estimating dedupli-cation ratios in large datasets. Our strategy uti-lizes less than 10MBs of RAM space, can be dis-tributed, and can accurately estimate the joint ben-efits of compression and deduplication (as well astheir separate benefits). The resulting estimation ispresented as a range in which the actual ratio is ex-pected to reside (rather than a single number). Thisallows for a gradual mode of estimation, where onecan sample a small fraction, evaluate it and con-tinue sampling if the resulting range is too loose.

Note that the execution time of the actual estima-tion algorithms is negligible vs. the time that ittakes to scan the data, so being able to stop with asmall sampling fraction is paramount to achievingan overall time improvement.

We evaluate the method on a number of real lifeworkloads and validate its high accuracy. Overallour method achieves at least a 3X time improve-ment over the state of the art scans. The timeimprovement varies according to the data and themedium on which data is stored, and can reach timeimprovements of 15X and more.

2 Background and the Core algorithm2.1 Preliminaries and BackgroundDeduplication is performed on data chunks of sizethat depends on the system at hand. In this paper weconsider fixed size chunks of 4KB, a popular choicesince it matches the underlying page size in manyenvironments. However the results can be easilygeneralized to different chunking sizes and meth-ods. Note that variable-sized chunking can be han-dled in our framework but adds complexity espe-cially with respect to the actual sampling of chunks.

As is customary in deduplication, we representthe data chunks by a hash value of the data (weuse the SHA1 hash function). Deduplication occurswhen two chunks have the same hash value.

Denote the dataset at hand by S and view itas consisting of N data chunks (namely N hashvalues). The dataset is made up from D distinctchunks, where the ith element appears ni times inthe dataset. This means that

∑Di=1 ni = N .

Our ultimate target is to come up with an esti-mation of the value D, or equivalently of the ratior = D

N . Note that throughout the paper we usethe convention where data reduction (deduplicationor compression) is a ratio between in [0, 1] wherelower is better. Namely, ratio 1.0 means no reduc-tion at all and 0.03 means that the data is reducedto 3% of its original size (97% saving).

When discussing the sampling, we will considera sample of size K out of the entire dataset of Nchunks. The corresponding sampling rate is de-noted by p = K

N (for brevity, we usually presentp in percentage rather than a fraction). We denoteby Sp the random sample of fraction p from S.

A key concept for this work that is what we terma Duplication Frequency Histogram (DFH) that isdefined next (note that in [17] this was termed the“fingerprint” of the dataset).

Definition 1. A Duplication Frequency His-togram (DFH) of a dataset S is a histogram x ={x1, x2, ...} in which the value xi is the number ofdistinct chunks that appeared exactly i times in S.

Page 5: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

280 14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

For example, the DFH of a dataset consisting ofN distinct elements will have x1 = N and zeroelsewhere. An equal sized dataset where all ele-ments appear exactly twice will have a DFH withx2 = N

2 and zero elsewhere. Note that for a legalDFH it must hold that

∑i xi · i = N and more-

over that∑

i xi = D. The length of a DFH is setby the highest non-zero xi. In other words it is thefrequency of the most popular chunk in the dataset.The same definition of DFH holds also when dis-cussing a sample rather than entire dataset.

The approach of the Unseen algorithm is to es-timate the DFH of a dataset and from it devise anestimation of the deduplication.

2.2 The Unseen AlgorithmIn this section we give a high level presentation ofthe core Unseen algorithm. The input of the algo-rithm is a DFH y of the observed sample Sp andfrom it the algorithm finds an estimation x̂ of theDFH of the entire dataset S. At a high level, thealgorithm finds a DFH x̂ on the full set that servesas the ”best explanation” to the observed DFH y onthe sample.

As a preliminary step, for a DFH x′ on the datasetS define the expected DFH y′ on a random p sam-ple of S. In the expected DFH each entry is ex-actly the statistical expectancy of this value in a ran-dom p sample. Namely y′i is the expected numberof chunks that appear exactly i times in a randomp fraction sample. For fixed p this expected DFHcan be computed given x′ via a linear transforma-tion and can be presented by a matrix Ap such thaty′ = Ap · x′.

The main idea in the Unseen algorithm is to findan x′ that minimizes a distance measure betweenthe expected DFH y′ and the observed DFH y. Thedistance measure used is a normalized L1 Norm(normalized by the values of the observed y). Weuse the following notation for the exact measure be-ing used:

Δp(x′, y) =

∑i

1√yi + 1

|yi − (Ap · x′)i| .

The algorithm uses Linear Programming for theminimization and is outlined in Algorithm 1.

The actual algorithm is a bit more complicateddue to two main issues: 1) this methodology issuited for estimating the duplication frequencies ofunpopular chunks. The very popular chunks can beestimated in a straightforward manner (an elementwith a high count c is expected to appear approx-imately p · c times in the sample). So the DFH isfirst broken into the easy part for straightforwardestimation and the hard part for estimation via Al-gorithm 1. 2) Solving a Linear program with too

Algorithm 1: Unseen CoreInput: Sample DFH y, fraction p, total size NOutput: Estimated deduplication ratio r̂/* Prepare expected DFH transformation*/

Ap ← prepareExpectedA(p);

Linear program:Find x′ that minimizes: Δp(x

′, y)Under constraints: /* x′ is a legal DFH */∑

i x′i · i = N and ∀i x′

i ≥ 0

return r̂ =∑

i x′i

N

many variables is impractical, so instead of solvingfor a full DFH x, a sparser mesh of values is used(meaning that not all duplication values are allowedin x). This relaxation is acceptable since this levelof inaccuracy has very little influence for high fre-quency counts. It is also crucial to make the runningtime of the LP low and basically negligible with re-spect to the scan time.

The matrix Ap is computed by a combinationof binomial probabilities. The calculation changessignificantly if the sampling is done with repetition(as was used in [18]) vs. without repetition. We re-fer the reader to [18] for more details on the corealgorithm.

3 From Theory to PracticeIn this section we present our work to actuallydeploying the Unseen estimation method for realworld deduplication estimation. Throughout thesection we demonstrate the validity of our resultsusing tests on a single dataset. This is done forclarity of the exposition and only serves as a repre-sentative of the results that where tested across allour workloads. The dataset is the Linux Hypervisordata (see Table 1 in Section 4) that was also used inFigure 1. The entire scope of results on all work-loads appears in the evaluation section (Section 4).

3.1 Gauging the AccuracyWe tested the core Unseen algorithm on real lifeworkloads and it has impressive results, and in gen-eral it thoroughly outperforms some of the estima-tors in the literature. The overall impression is thata 15% sample is sufficient for accurate estimation.On the other hand the accuracy level varies greatlyfrom one workload to the next, and often the esti-mation obtained from 5% or even less is sufficientfor all practical purposes. See example in Figure 2.

So the question remains: how to interpret the es-timation result and when have we sampled enough?To address these questions we devise a new ap-proach that returns a range of plausible deduplica-

Page 6: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 281

Figure 2: The figure depicts sampling trials at each of thefollowing sample percentages 1%, 2%, 5%, 7.5%, 10%.15% and 20%. For each percentage we show the resultsof Unseen on 30 independent samples. The results arevery noisy at first, for example, on the 1% samples theyrange from ∼ 0.17 all the way to ∼ 0.67. But we see anice convergence starting at 7.5%. The problem is thatseeing just a single sample and not 30, gives little indica-tion about the accuracy and convergence.

tion ratios rather than a single estimation number.In a nut shell, the idea is that rather than give theDFH that is the ”best explanation” to the observedy, we test all the DFHs that are a ”reasonable ex-planation” of y and identify the range of possibleduplication in these plausible DFHs.

Technically, define criteria for all plausible solu-tions x′ that can explain an observed sample DFHy. Of all these plausible solutions we find the oneswhich give minimal deduplication ratio and max-imal deduplication ratio. In practice, we add twoadditional linear programs to the first initial opti-mization. The first linear program helps in identi-fying the neighborhood of plausible solutions. Thesecond and third linear programs find the two lim-its to the plausible range. In these linear programswe replace the optimization on the distance mea-sure with an optimization on the number of distinctchunks. The method is outlined in Algorithm 2.

Note that in [18] there is also a use of a secondlinear program for entropy estimation, but this isdone for a different purpose (implementing a sortof Occam’s razor).

Why does it work? We next describe the intu-ition behind our solution: Consider the distribu-tion of Δp(x, y) for a fixed x and under random y(random y means a DFH of a randomly chose p-sample). Suppose that we knew the expectancyE(Δp) and standard deviation σ of Δp. Then givenan observed y, we expect, with very high probabil-ity, that the only plausible source DFHs x′ are suchthat Δp(x

′, y) is close to E(Δp) (within α · σ forsome slackness variable α). This set of plausible x′

can be fairly large, but all we really care to learn

Algorithm 2: Unseen RangeInput: Sample DFH y, fraction p, total size N,

slackness α (default α = 0.5)Output: Deduplication ratio range [r, r]

/* Prepare expected DFH transformation*/Ap ← prepareExpectedA(p);

1st Linear program:Find x′ that minimizes: Δp(x

′, y)Under constraints: /* x′ is a legal DFH */∑

i x′i · i = N and ∀i x′

i ≥ 0

For the resulting x′ compute: Opt = Δp(x′, y)

2nd Linear program:Find x that minimizes:

∑i xi

Under constraints:∑

i xi · i = N and ∀i xi ≥ 0 andΔp(x, y) < Opt+ α

√Opt

3rd Linear program:Find x that maximizes:

∑i xi

Under constraints:∑

i xi · i = N and ∀i xi ≥ 0 andΔp(x, y) < Opt+ α

√Opt

return [r =∑

i xi

N , r =∑

i xi

N ]

about it is its boundaries in terms of deduplicationratio. The second and third linear programs find outof this set of plausible DFHs the ones with the bestand worst deduplication ratios .

The problem is that we do not know how tocleanly compute the expectancy and standard de-viation of Δp, so we use the first linear programto give us a single value within the plausible range.We use this result to estimate the expectancy andstandard deviation and give bounds on the range ofplausible DFHs.

Setting the slackness parameter.. The main toolthat we have in order to fine tune the plausible so-lutions set is the slackness parameter α. A smallα will result in a tighter estimation range, yet riskshaving the actual ratio fall outside of the range. Ourchoice of slackness parameter is heuristic and tai-lored to the desired level of confidence. The choicesof this parameter throughout the paper are made bythorough testing across all of our datasets and thevarious sample sizes. Our evaluations show thata slackness of α = 0.5 is sufficient and one canchoose a slightly larger number for playing it safe.A possible approach is to use two different levels of

Page 7: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

282 14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

slackness and present a “likely range” along with a“safe range”.

In Figure 3 we see an evaluation of the rangemethod. We see that our upper and lower boundsgive an excellent estimation of the range of possibleresults that the plain Unseen algorithm would haveproduced on random samples of the given fraction.

Figure 3: This figure depicts the same test as in Fig-ure 2, but adds the range algorithm results. This givesa much clearer view – For example, one can deduce thatthe deduplication ratio is better than 50% already at a5% sample and that the range has converged significantlyat this point, and is very tight already at 10%.

An interesting note is that unlike many statisticalestimation in which the actual result has a high like-lihood to be at the center of the range, in our caseall values in the range can be equally likely.

Evaluating methods via average range tightness.The range approach is also handy in comparing thesuccess of various techniques. We can evaluate theaverage range size of two different methods andchoose the one that gives tighter average range sizeusing same sample percentage. For example wecompare running the algorithm when the samplingis with repetitions (this was the approach taken in[18]) versus taking the sample without repetitions.As mentioned in Section 2.2, this entails a differ-ent computation of the matrix Ap. Not surprisingly,taking samples without repetitions is more success-ful, as seen in Figure 4. This is intuitive since rep-etitions reduce the amount of information collectedin a sample. Note that sampling without repetitionis conceptually simpler since it can be done in a to-tally stateless manner (see Section 3.4). Since theresults in Figure 4 were consistent with other work-loads, we focus our attention from here on solely onthe no repetition paradigm.

3.2 Running with Low MemoryRunning the estimation with a 15% sample requiresthe creation of a DFH for the entire sample, whichin turn requires keeping tab on the duplication fre-quencies of all distinct elements in the sample.

Figure 4: An average range size comparison of samplingwith repetitions vs. without repetitions. For each sam-pling percentage the averages are computed over 100samples. Other than the 1% sample which is bad to beginwith, the no repetition sampling is consistently better.

Much like the case of full scans, this quickly be-comes an obstacle in actually deploying the algo-rithm. For example, in our largest test data set,that would mean keeping tab on approximately 200Million distinct chunks, which under very strict as-sumptions would require on the order of 10GBs ofRAM, unless one is willing to settle for slow diskIOs instead of RAM operations. Moreover, in adistributed setting it would require moving GBs ofdata between nodes. Such high resource consump-tion may be feasible in a dedicated system, but notfor actually determining deduplication ratios in thefield, possibly at a customer’s site and on the cus-tomers own servers.

We present two approaches in order to handlethis issue, both allowing the algorithm to run withas little as 10MBs of RAM. The first achieves rel-atively tight estimations (comparable to the highmemory algorithms). The second produces some-what looser estimations but is more flexible to us-age in distributed or dynamic settings.

The base sample approach. This approach fol-lows the low-memory technique of [12] and usesit to estimate the DFH using low memory. In thismethod we add an additional base step so the pro-cess is as follows:

1. Base sample: Sample C chunks from the dataset (C is a “constant” – a relatively small num-ber, independent of the database size). Notethat we allow the same hash value to appearmore than once in the base sample.

2. Sample and maintain low-memory chunkhistogram: Sample a p fraction of the chunksand iterate over all the chunks in the sample.Record a histogram (duplication counts) for allthe chunks in the base sample (and ignore therest). Denote by cj the duplication count of thejth chunk in the base sample (j ∈ {1, ..., C}).

Page 8: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 283

3. Extrapolate DFH: Generate an estimatedDFH for the sample as follows:

∀i, yi = |{j|cj = i}|i

pN

C.

In words, use the number of chunks in the basesample that had count i, extrapolated to the en-tire sample.

The crux is that the low-memory chunk his-togram can produce a good approximation to theDFH. This is because the base sample was repre-sentative of the distribution of chunks in the en-tire dataset. In our tests we used a base sampleof size C = 50, 000 which amounts to less than10MBs of memory consumption. The method does,however, add another estimation step to the processand this adds noise to the overall result. To copewith it we need to increase the slackness parameterin the Range Unseen algorithm (from α = 0.5 toα = 2.5). As a result, the estimation range suffersa slight increase, but the overall performance is stillvery good as seen in Figure 5.

Figure 5: The graph depicts the average upper and lowerbounds that are given by the algorithm when using thebase sample method. Each point is the average over 100different samples, and the error bars depict the maximumand minimum values over the 100 tests.

The only shortcoming of this approach is that thedataset to be studied needs to be set in advance,otherwise the base sample will not cover all of it.In terms of distribution and parallel execution, thebase sample stage needs to be finished and finalizedbefore running the actual sampling phase which isthe predominant part of the work (this main phasecan then be easily parallelized). To overcome thiswe present a second approach, that is more dynamicand amenable to parallelism yet less tight.

A streaming approach. This method uses tech-niques from streaming algorithms geared towardsdistinct elements evaluation with low memory. Inorder to mesh with the Unseen method the basictechnique needs to be slightly modified and collectfrequency counts that were otherwise redundant.

The core principles, however, remain the same: Asmall (constant size) sample of distinct chunks istaken uniformly over the distinct chunks in the sam-ple. Note that such a sampling disregards the popu-larity of a specific hash value, and so the most pop-ular chunks will likely not be part of the sample.As a result, this method cannot estimate the sam-ple DFH correctly but rather takes a different ap-proach. Distinct chunk sampling can be done usingseveral techniques (e.g. [2, 8]). We use here thetechnique of [2] where only the C chunks that havethe highest hash values (when ordered lexicograph-ically) are considered in the sample. The algorithmis then as follows:

1. Sample and maintain low-memory chunkhistogram: Sample a p fraction of the chunksand iterate over all the chunks in the sample.Maintain a histogram only of chunks that haveone of the C highest hash values:

• If the hash is in the top C, increase itscounter.

• If it is smaller than all the C currently inthe histogram then ignore it.

• Otherwise, add it to the histogram anddiscard the lowest of the current Chashes.

Denote by δ the fraction of the hash domainthat was covered by the C samples. Namely,if the hashes are calibrated to be numbers inthe range [0, 1] then δ is the distance between1 and the lowest hash in the top-C histogram.

2. Run Unseen: Generate a DFH solely of theC top hashes and run the Range Unseen al-gorithm. But rather than output ratios, outputa range estimation on the number of distinctchunks. Denote this output range by [d, d].

3. Extrapolation to full sample: Output estima-tion range [r = d

δ·N , r = dδ·N ].

Unlike the base sample method, the streamingapproach does not attempt to estimate the DFH ofthe p-sample. Instead, it uses an exact DFH of asmall δ fraction of the hash domain. The Unseenalgorithm then serves as a mean of estimating theactual number of distinct hashes in this δ sized por-tion of the hash domain. The result is then extrapo-lated from the number of distinct chunks in a smallhash domain, to the number of hashes in the entiredomain. This relies on the fact that hashes shouldbe evenly distributed over the entire range, and a δfraction of the domain should hold approximately aδ portion of the distinct hashes.

The problem here is that the Unseen algorithmruns on a substantially smaller fraction of the datathan originally. Recall that it was shown in [18]

Page 9: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

284 14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

that accuracy is achieved at a sample fraction ofO( 1

logN ) and therefore we expect the accuracy tobe better when N is larger. Indeed, when limit-ing the input of Unseen to such a small domain (insome of our tests the domain is reduced by a factorof more than 20, 000) then the tightness of the es-timation suffers. Figure 6 shows an example of theestimation achieved with this method.

Figure 6: The graph depicts the average upper and lowerbounds that are given by the algorithm with the streamingmodel as a function of sample percentage. Each pointis the average over 100 different samples, and the errorbars depict the maximum and minimum values over the100 tests.

In Figure 7 we compare the tightness of theestimation achieved by the two low-memory ap-proaches. Both methods give looser results thanthe full fledged method, but the base sampling tech-nique is significantly tighter.

Figure 7: An average range size comparison of the twolow-memory methods (and compared to the full memoryalgorithm). We see that the base method achieves tighterestimation ranges, especially for the higher and moremeaningful percentages.

On the flip side, the streaming approach is muchsimpler to use in parallel environments where eachnode can run his sample independently and at theend all results are merged and a single Unseen exe-cution is run. Another benefit is that one can run anestimation on a certain set of volumes and store thelow-memory histogram. Then, at a later stage, new

volumes can be scanned and merged with the exist-ing results to get an updated estimation. Althoughthe streaming approach requires a larger sample inorder to reach the same level of accuracy, there arescenarios where the base sample method cannot beused and this method can serve as a good fallbackoption.

3.3 Estimating Combined Compres-sion and Deduplication

Deduplication, more often than not, is used in con-junction with compression. The typical usage is tofirst apply deduplication and then store the actualchunks in compressed fashion. Thus the challengeof sizing a system must take into account compres-sion as well as deduplication. The obvious solutionto estimating the combination of the two techniquesis by estimating each one separately and then look-ing at their multiplied effect. While this practice hasits merits (e.g. see [6]), it is often imprecise. Thereason is that in some workloads there is a correla-tion between the duplication level of a chunk andits average compression ratio (e.g. see Figure 8).

We next describe a method of integrating com-pression into the Unseen algorithm that results inaccurate estimations of this combination.

The basic principle is to replace the DFH by acompression weighted DFH. Rather than havingxi hold the number of chunks that appeared i times,we define it as the size (in chunks) that it takes tostore the chunks with reference count i. Or in otherwords, multiply each count in the regular DFH bythe average compression ratio of chunks with thespecific duplication count.

The problem is that this is no longer a legal DFHand in particular it no longer holds that

∑i

xi · i = N.

Instead, it holds that∑i

xi · i = CR ·N

where CR is the average compression ratio over theentire dataset (plain compression without dedupli-cation). Luckily, the average compression ratio canbe estimated extremely well with a small randomsample of the data.

The high level algorithm is then as follows:

1. Compute a DFH {y1, y2, ...} on the observedsample, but also compute {CR1,CR2, ...}where CRi is the average compression ratio forall chunks that had reference count i. Denoteby z = {y1 · CR1, y2 · CR2, ...} the compres-sion weighted DFH of the sample.

Page 10: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 285

2. Compute CR, the average compression ratioon the dataset. This can be done using a verysmall random sample (which can be part of thealready sampled data).

3. Run the Unseen method where the optimiza-tion is for Δp(x, z) (rather than Δ(x, y)p) andunder the constraint that

∑i xi · i = CR ·N .

Figure 8 shows the success of this method on thesame dataset and contrasts it to the naive approachof looking at deduplication and compression inde-pendently.

Figure 8: The graph depicts the average upper and lowerbounds that are given by the algorithm with compression.Each point is based on 100 different samples. The naiveratio is derived by multiplying independent compressionand deduplication ratios.

Note that estimating CR can also be done ef-ficiently and effectively under our two models oflow-memory execution: In the base sample method,taking the average compression ratio on the basechunks only is sufficient. So compression needs tobe computed only in the initial small base samplephase. In the streaming case, things are a bit morecomplicated, but in a nutshell, the average CR forthe chunks in the fraction of hashes at hand is esti-mated as the weighted average of the compressionratios of the chunks in the top C hashes, were theweight is their reference counts (a chunk that ap-peared twice is given double the weight).

3.4 How to Sample in the Real World

Thus far we have avoided the question of how to ac-tually sample chunks from the dataset, yet our sam-pling has a long list of requirements:

• Sample uniformly at random over the entire(possibly distributed) dataset.

• Sample without repetitions.• Use low memory for the actual sampling.• We want the option to do a gradual sample,

e.g., first sample a small percent, evaluate it,and then add more samples if needed (withoutrepeating old samples).

• Above all, we need this to be substantiallyfaster than running a full scan (otherwise thereis no gain). Recall that the scan time domi-nates the running time (the time to solve thelinear programs is negligible).

The last requirement is the trickiest of them all, es-pecially if the storage medium is based on rotatingdisks (HDDs). If the data lies on a storage systemthat supports fast short random reads (flash or solidstate drive based systems), then sampling is muchfaster than a full sequential scan. The problem isthat in HDDs there is a massive drop-off from theperformance of sequential reads to that of small ran-dom reads.

There are some ways to mitigate this drop off:sorting the reads in ascending order is helpful, butmainly reading at larger chunks than 4KB, where1MB seems the tipping point. In Figure 9 we seemeasurements on the time it takes to sample a frac-tion of the data vs. the time a full scan would take.While it is very hard to gain anything by samplingat 4KB chunks, there are significant time savingsin sampling at 1MBs, and for instance, sampling a15% fraction of the data is 3X faster than a full scan(this is assuming sorted 1MB reads).

Figure 9: The graph depicts the time it takes to sampledifferent percentages at different chunk sizes with respectto a full scan. We see that sampling at 4K is extremelyslow, and while 64K is better, one has to climb up to 1MBto get decent savings. Going higher than 1MB shows lit-tle improvement. The tests where run on an Intel XeonE5440 @ 2.83GHz CPU with a 300GB Hitachi GST Ul-trastar 15K rpm HDD.

Accuracy with 1MB reads? The main question isthen: does our methodology work with 1MB reads?Clearly, estimating deduplication at a 1MB granu-larity is not a viable solution since deduplication ra-tios can change drastically with the chunk size. In-stead we read super-chunks of 1MB and break theminto 4KB chunks and use these correlated chunksfor our sample. The main concern here is thatthe fact that samples are not independent will form

Page 11: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

286 14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

Name Description Size DeduplicationRatio

Deduplication +Compression

VM Repository A large repository of VMs used by a development unit. Thisis a VMWare environment with a mixture of Linux and Win-dows VMs.

8TB 0.3788 0.2134

Linux Hypervisor Backend store for Linux VMs of a KVM Hypervisor. TheVMs belong to a research unit.

370 GB 0.4499 0.08292

Windows Hypervi-sor

Backend store for Windows VMs of a KVM Hypervisor. TheVMs belong to a research unit.

750 GB 0.7761 0.4167

VDI A VDI benchmark environment containing 50 Windows VMsgenerated by VMWare’s ViewPlanner tool

770 GB 0.029 0.0087

DB An Oracle Data Base containing data from a TPCC bench-mark

1.2TB 0.37884 0.21341

Cloud Store A research unit’s private cloud storage. Includes both userdata and VM images.

3.3TB 0.26171 –

Table 1: Data generation approaches for several widely adopted benchmarks in various storage domains.

high correlations between the reference counts ofthe various chunks in the sample. For example, inmany deduplication friendly environments, the rep-etitions are quite long, and a repetition of a singlechunk often entails a repetition of the entire super-chunk (and vice-versa, a non repetition of a singlechunk could mean high probability of no repetitionsin the super-chunk).

The good news is that due to linearity of expecta-tions, the expected DFH should not change by sam-pling at a large super-chunk. On the other hand thevariance can grow significantly. As before, we con-trol this by increasing the slackness parameter α toallow a larger scope of plausible solutions to be ex-amined. In our tests we raise the slackness fromα = 0.5 to α = 2.0 and if combined with the basesample it is raised to α = 3.5. Our tests showthat this is sufficient to handle the increased vari-ance, even in workloads where we know that thereare extremely high correlations in the repetitions.Figure 10 shows the algorithm result when read-ing with 1MB super-chunks and with the increasedslackness.

Figure 10: The graph depicts the average upper andlower bounds that are given by the algorithm when read-ing at 1MB super-chunks. Each point is based on 100different samples.

How to sample? We next present our samplingstrategy that fulfills all of the requirements listedabove with the additional requirement to generatesorted reads of a configurable chunk size. The pro-

cess iterates over all chunk IDs in the system andcomputes a fast hash function on the chunk ID (theID has to be a unique identifier, e.g. volume nameand chunk offset). The hash can be a very efficientfunction like CRC (we use a simple linear functionmodulo a large number). This hash value is thenused to determine if the chunk is in the current sam-ple or not. The simple algorithm is outlined in Al-gorithm 3.

Algorithm 3: Simple SampleInput: Fraction bounds p0, p1, Total chunks MOutput: Sample S(p1−p0)

for j ∈ [M ] doq ←− FastHash(j)/* FastHash outputs a number in [0, 1) */if p0 ≤ q < p1 then

Add jth chunk to sample

This simple technique fulfills all of the require-ments that we listed above. It can be easily par-allelized and in fact there is no limitation on theenumeration order. However, within each disk, ifone iterates in ascending order then the reads willcome out sorted, as required. It is nearly stateless,one only need to remember the current index j andthe input parameters. In order to run a gradual sam-ple, for example, a first sample of 1% and then addanother 5% – run it first with p0 = 0, p1 = 0.01and then again with p0 = 0.01, p1 = 0.06. Thefast hash is only required to be sufficiently random(any pairwise independent hash would suffice [3])and there is no need for a heavy full-fledged cryp-tographic hash like SHA1. As a result, the mainloop can be extremely fast, and our tests show thatthe overhead of the iteration and hash is negligible(less than 0.5% of the time that it takes to sample a1% fraction of the dataset). Note that the result is asample of fraction approximately (p1 − p0) whichis sufficient for all practical means.

Page 12: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 287

4 Evaluation

4.1 Implementation and Test DataWe implemented the core techniques in Matlab andevaluated the implementation on data from a varietyof real life datasets that are customary to enterprisestorage systems. The different datasets are listed inTable 1 along with their data reduction ratios. It wasparamount to tests datasets from a variety of dedu-plication and compression ratios in order to vali-date that our techniques are accurate for all ranges.Note that in our datasets we remove all zero chunkssince identifying the fraction of zero chunks is aneasy task and the main challenge is estimating thededuplication ratio on the rest of the data.

For each of the datasets we generated between30-100 independent random samples for each sam-ple percentage and for each of the relevant samplingstrategy being deployed (e.g., with or without repe-titions/ at 1MB super-chunks). These samples serveas the base for verifying our methods and fine tun-ing the slackness parameter.

4.2 ResultsRange sizes as function of dataset. The mostglaring phenomena that we observe while testingthe technique over multiple datasets is the big dis-crepancy in the size of the estimation ranges fordifferent datasets. The most defining factor wasthe data reduction ratio at hand. It turns out thatdeduplication ratios that are close to 1

2 are in gen-eral harder to approximate accurately and require alarger sample fraction in order to get a tight estima-tion. Highly dedupable data and data with no du-plication, on the other hand tend to converge veryquickly and using our method, one can get a verygood read within the first 1-2% of sampled data.Figure 12 shows this phenomena clearly.

Note that the addition of compression ratios inthe mix has a different effect and it basically re-duces the range by a roughly a constant factor thatis tied to the compression benefits. For example, forthe Windows Hypervisor data the combine dedupli-cation and compression ratio is 0.41, an area wherethe estimation is hardest. But the convergence seenby the algorithm is significantly better – it is sim-ilar to what is seen for deduplication (0.77 ratio)with a roughly constant reduction factor. See ex-ample in Figure 13. According to the other datasetwe observe that this reduction factor is more signif-icant when the compression ratio is better (as seenin other datasets).

The accumulated effect on estimation tightness.In this work we present two main techniques thatare critical enablers for the technology but reduce

Figure 12: The graph depicts the effect that the dedupli-cation ratio has on the tightness of our estimation method(based on the 6 different datasets and their various dedu-plication ratios). Each line stands for a specific samplepercent and charts the average range size as a functionof the deduplication ratio. We see that while the bottomlines of 10% and more are good across all deduplicationratios, the top lines of 1-5% are more like a “Boa di-gesting an Elephant” – behave very well at the edges butballoon in the middle.

Figure 13: A comparison of the estimation range sizesachieved on the Windows Hypervisor data. We see a con-stant and significantly tighter estimation when compres-sion is involved. This phenomena holds for all datasets.

the tightness of the initial estimation. The accu-mulated effect on the tightness of estimation by us-ing the combination of these techniques is shown inFigure 14. It is interesting to note that combiningthe two techniques has a smaller negative effect onthe tightness than the sum of their separate effects.

Putting it all together. Throughout the paper weevaluated the effect of each of our innovations sepa-rately and sometimes understanding joint effects. Inthis section we aim to put all of our techniques to-gether and show a functional result. The final con-struction runs the Range Unseen algorithm, withthe Base Sample low-memory technique while sam-pling at 1MB super-chunks. We test both estimatingdeduplication only and estimation with compres-sion. Each test consists of a single gradual execu-

Page 13: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

288 14th USENIX Conference on File and Storage Technologies (FAST ’16) USENIX Association

Figure 11: Execution of the full real life setting algorithm on the various datasets. This is a gradual run of the low-memory, 1MB read algorithm, with and without compression. Note that some of the tests reach very tight results with a3% sample and can thus achieve a much improved running time.

Figure 14: A comparison of the estimation range sizesachieved with various work methods: The basic method,the method with sampling at 1MB, using the base sam-ple low-memory technique and the combination of boththe base sample and 1MB sampling. There is a steadydecline in tightness when moving from one to the next.This is shown on two different datasets with very differ-ent deduplication ratios.

tion starting from 1% all the way through 20% atsmall intervals of 1%. The results are depicted inFigure 11.

There are two main conclusions from the resultsin Figure 11. One is that the method actually worksand produces accurate and useful results. The sec-ond is the great variance between the different runsand the fact that some runs can end well short ofa 5% sample. As mentioned in the previous para-graph, this is mostly related to the deduplicationand compression ratios involved. But the bottomline is that we can calibrate this into the expectedtime it takes to run the estimation. In the worstcase, one would have to run read at least 15% ofthe data, which leads to a time improvement of ap-proximately 3X in HDD systems (see Section 3.4).On the other hand, we have tests that can end with asample of 2-3% and yield a time saving of 15-20Xover a full scan. The time improvement can be evenmore significant in cases where the data resides onSSDs and if the hash computation is a bottle neckin the system.

5 Concluding remarks

Our work introduced new advanced algorithms intothe world of deduplication estimation. The mainchallenges were to make these techniques actuallyapplicable and worthwhile in a real world scenario.We believe we have succeeded in proving the valueof this approach, which can be used to replace fullscans used today.

Page 14: Estimating Unseen Deduplication— from Theory to Practice · USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 277 Estimating Unseen Deduplication

USENIX Association 14th USENIX Conference on File and Storage Technologies (FAST ’16) 289

Acknowledgements. We thank Evgeny Lipovet-sky for his help and thank Oded Margalit andDavid Chambliss for their insights. Finally, weare grateful to our many colleagues at IBM thatmade the collections of the various datasets pos-sible. This work was partially funded by theEuropean Community’s Seventh Framework Pro-gramme (FP7/2007-2013) project ForgetIT undergrant No. 600826.

References[1] XtremeIO case studies. http:

//xtremio.com/case-studies.

[2] BAR-YOSSEF, Z., JAYRAM, T. S., KUMAR,R., SIVAKUMAR, D., AND TREVISAN, L.Counting distinct elements in a data stream.In Randomization and Approximation Tech-niques, 6th International Workshop, RAN-DOM 2002 (2002), pp. 1–10.

[3] CARTER, L., AND WEGMAN, M. N. Univer-sal classes of hash functions. J. Comput. Syst.Sci. 18, 2 (1979), 143–154.

[4] CHARIKAR, M., CHAUDHURI, S., MOT-WANI, R., AND NARASAYYA, V. R. Towardsestimation error guarantees for distinct values.In Symposium on Principles of Database Sys-tems, PODS 2010 (2000), pp. 268–279.

[5] CONSTANTINESCU, C., GLIDER, J. S., ANDCHAMBLISS, D. D. Mixing deduplicationand compression on active data sets. InData Compression Conference, DCC (2011),pp. 393–402.

[6] CONSTANTINESCU, C., AND LU, M. QuickEstimation of Data Compression and De-duplication for Large Storage Systems. InProceedings of the 2011 First InternationalConference on Data Compression, Communi-cations and Processing (2011), IEEE, pp. 98–102.

[7] FLAJOLET, P., AND MARTIN, G. N. Proba-bilistic counting algorithms for data base ap-plications. J. Comput. Syst. Sci. 31, 2 (1985),182–209.

[8] GIBBONS, P. B. Distinct sampling for highly-accurate answers to distinct values queriesand event reports. In Very Large Data BasesVLDB (2001), pp. 541–550.

[9] HAAS, P. J., NAUGHTON, J. F., SESHADRI,S., AND STOKES, L. Sampling-based esti-mation of the number of distinct values of anattribute. In Very Large Data Bases VLDB’95(1995), pp. 311–322.

[10] HAAS, P. J., AND STOKES, L. Estimating thenumber of classes in a finite population. IBM

Research Report RJ 10025, IBM Almaden Re-search 93 (1998), 1475–1487.

[11] HARNIK, D., KAT, R., MARGALIT, O., SOT-NIKOV, D., AND TRAEGER, A. To Zip or Notto Zip: Effective Resource Usage for Real-Time Compression. In Proceedings of the11th USENIX conference on File and StorageTechnologies (FAST 2013) (2013), USENIXAssociation, pp. 229–241.

[12] HARNIK, D., MARGALIT, O., NAOR, D.,SOTNIKOV, D., AND VERNIK, G. Estimationof deduplication ratios in large data sets. InIEEE 28th Symposium on Mass Storage Sys-tems and Technologies, MSST 2012 (2012),pp. 1–11.

[13] JIN, K., AND MILLER, E. L. The effective-ness of deduplication on virtual machine diskimages. In The Israeli Experimental SystemsConference, SYSTOR 2009 (2009).

[14] KANE, D. M., NELSON, J., ANDWOODRUFF, D. P. An optimal algo-rithm for the distinct elements problem.In Symposium on Principles of DatabaseSystems, PODS 2010 (2010), pp. 41–52.

[15] MEYER, D. T., AND BOLOSKY, W. J. Astudy of practical deduplication. In FAST-Proceedings of the 9th USENIX Conferenceon File and Storage Technologies (FAST ’11)(2011), pp. 1–13.

[16] RASKHODNIKOVA, S., RON, D., SHPILKA,A., AND SMITH, A. Strong lower bounds forapproximating distribution support size andthe distinct elements problem. SIAM J. Com-put. 39, 3 (2009), 813–842.

[17] VALIANT, G., AND VALIANT, P. Estimatingthe unseen: an n/log(n)-sample estimator forentropy and support size, shown optimal vianew clts. In 43rd ACM Symposium on Theoryof Computing, STOC (2011), pp. 685–694.

[18] VALIANT, P., AND VALIANT, G. Estimatingthe unseen: Improved estimators for entropyand other properties. In Advances in Neu-ral Information Processing Systems 26. 2013,pp. 2157–2165.

[19] XIE, F., CONDICT, M., AND SHETE, S. Es-timating duplication by content-based sam-pling. In Proceedings of the 2013 USENIXConference on Annual Technical Conference(2013), USENIX ATC’13, pp. 181–186.