View
215
Download
0
Category
Preview:
Citation preview
Data Mining TechnologiesData Mining Technologies for Digital Libraries for Digital Libraries
& Web Information Systems& Web Information Systems
Ramakrishnan SrikantRamakrishnan Srikant
Talk OutlineTalk Outline
Taxonomy Integration (WWW 2001, with R. Taxonomy Integration (WWW 2001, with R. Agrawal)Agrawal)
Searching with Numbers Searching with Numbers Privacy-Preserving Data MiningPrivacy-Preserving Data Mining
Taxonomy IntegrationTaxonomy Integration
B2B electronics portal: 2000 categories, 200K B2B electronics portal: 2000 categories, 200K datasheetsdatasheets
Master Catalog New Catalog
DSP Mem. Logic
ICs
a b c d e f
Cat1 Cat2
ICs
x y z w
Taxonomy Integration (2)Taxonomy Integration (2)
After integration:After integration:
DSP Mem. Logic
ICs
a b c d e fx y z w
GoalGoal
Use affinity information in new catalog.Use affinity information in new catalog.– Products in same category are similar.Products in same category are similar.
Accuracy boost depends on match between two Accuracy boost depends on match between two categorizations.categorizations.
Problem StatementProblem Statement
Given Given
– master categorization master categorization M: M: categories categories CC11, , CC22, …, , …, CCnn
set of documents in each categoryset of documents in each category
– new categorization new categorization N:N: categories categories SS11, , SS22, …, , …, SSnn
set of documents in each categoryset of documents in each category
Find the category in Find the category in MM for each document in for each document in NN
– Standard Alg: Estimate Pr(Standard Alg: Estimate Pr(CCii | d | d))
– Enhanced Alg: Estimate Pr(Enhanced Alg: Estimate Pr(CCii | | dd, , SS))
Naive Bayes ClassifierNaive Bayes Classifier
Estimate probability of document Estimate probability of document dd belonging to belonging to class class CCii
WhereWhere)Pr(
)|Pr()Pr()|Pr(
d
CdCdC ii
i
documents ofnumber Total
in documents ofNumber )Pr( ii
CC
dt
ii CtCd )|Pr()|Pr(
i
ii C
CCt
in wordsTotal
in t of soccurrence of #)|Pr(
Enhanced Naïve BayesEnhanced Naïve Bayes
Standard:Standard:
Enhanced:Enhanced:
How do we estimate Pr(How do we estimate Pr(CCii|S|S)?)?
)|Pr()|Pr(),|Pr( iii CdSCSdC
)|Pr()Pr()|Pr( iii CdCdC
– Apply standard Naïve Bayes to get number of Apply standard Naïve Bayes to get number of documents in S that are classified into documents in S that are classified into CCii
– Incorporate weight Incorporate weight ww reflecting match between reflecting match between two taxonomies.two taxonomies.
Only affect classification of borderline documents.Only affect classification of borderline documents.
– For For ww = 0, default to standard classifier. = 0, default to standard classifier.
Enhanced Naïve Bayes (2)Enhanced Naïve Bayes (2)
Use tuning set to determine w.Use tuning set to determine w.
j jj
iii CSC
CSCSC
)) in be topredicted in docs(#|(|
) in be topredicted in docs(#||)|Pr(
w
w
w) in be topredicted in docs(#||)|Pr( iii CSCSC
Intuition behind AlgorithmIntuition behind Algorithm
StandardStandard
AlgorithmAlgorithm
Computer Peripheral
Digital Camera
P1 20% 80%P2 40% 60%P3 60% 40%
Computer Peripheral
Digital Camera
P1 15% 85%P2 30% 70%P3 45% 55%
EnhancedEnhanced
AlgorithmAlgorithm
Electronic Parts DatasetElectronic Parts Dataset
Accuracy Improvement on Pangea Data
60
70
80
90
100
1 2 5 10 25 50 100 200
Weight
Acc
urac
y
Perfect90-1080-20GaussianAGaussianBBase
1150 categories; 37,000 documents
Yahoo & OpenDirectoryYahoo & OpenDirectory
5 slices of the hierarchy: Autos, Movies, Outdoors, 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, SoftwarePhotography, Software– Typical match: 69%, 15%, 3%, 3%, 1%, ….Typical match: 69%, 15%, 3%, 3%, 1%, ….
Merging Yahoo into OpenDirectoryMerging Yahoo into OpenDirectory– 30% fewer errors (14.1% absolute difference in 30% fewer errors (14.1% absolute difference in
accuracy)accuracy) Merging OpenDirectory into YahooMerging OpenDirectory into Yahoo
– 26% fewer errors (14.3% absolute difference)26% fewer errors (14.3% absolute difference)
SummarySummary
New algorithm for taxonomy integration.New algorithm for taxonomy integration.– Exploits affinity information in the new (source) Exploits affinity information in the new (source)
taxonomy categorizations.taxonomy categorizations.– Can do substantially better, and never does Can do substantially better, and never does
significantly worse than standard Naïve Bayes.significantly worse than standard Naïve Bayes. Open Problems: SVM, Decision Tree, ...Open Problems: SVM, Decision Tree, ...
Talk OutlineTalk Outline
Taxonomy Integration Taxonomy Integration Searching with Numbers (WWW 2002, with R. Searching with Numbers (WWW 2002, with R.
Agrawal)Agrawal) Privacy-Preserving Data MiningPrivacy-Preserving Data Mining
MotivationMotivation
A large fraction of useful web consists of specification A large fraction of useful web consists of specification documents.documents.– <attribute name, value> pairs embedded in text.<attribute name, value> pairs embedded in text.
Examples:Examples:– Data sheets for electronic parts.Data sheets for electronic parts.– Classified ads.Classified ads.– Product catalogs.Product catalogs.
Search Engines treat Search Engines treat Numbers as StringsNumbers as Strings
Search for 6798.32 (lunar nutation cycle)Search for 6798.32 (lunar nutation cycle)– Returns 2 pages on GoogleReturns 2 pages on Google– However, search for 6798.320 yielded no page However, search for 6798.320 yielded no page
on Google (and all other search engines) on Google (and all other search engines) Current search technology is inadequate for Current search technology is inadequate for
retrieving specification documents.retrieving specification documents.
Data Extraction is hardData Extraction is hard
Synonyms for attribute Synonyms for attribute names and units.names and units.– "lb" and "pounds", but no "lb" and "pounds", but no
"lbs" or "pound"."lbs" or "pound". Attribute names are often Attribute names are often
missing.missing.– No "Speed", just "MHz No "Speed", just "MHz
Pentium III" Pentium III" – No "Memory", just "MB No "Memory", just "MB
SDRAM"SDRAM"
• 850 MHz Intel Pentium III
• 192 MB RAM
• 15 GB Hard Disk
• DVD Recorder: Included;
• Windows Me
• 14.1 inch display
• 8.0 pounds
Searching with NumbersSearching with Numbers
IBM ThinkPad750 MHz Pentium 3,196 MB DRAM, …
Dell Computer700 MHz Celeron,
256 MB SDRAM, …
Database
IBM ThinkPad (750 MHz, 196 MB)
… Dell (700 MHz, 256 MB)800 200 3 lb
800 200
ReflectivityReflectivity
If we get a close match on numbers, how likely is it If we get a close match on numbers, how likely is it that we have correctly matched attribute names?that we have correctly matched attribute names?– Likelihood Likelihood Non-reflectivity (of data) Non-reflectivity (of data)
Non-overlapping attributes Non-overlapping attributes Non-reflective. Non-reflective.– Memory: 64- 512 Mb, Disk: 10 - 40 GbMemory: 64- 512 Mb, Disk: 10 - 40 Gb
Correlations or Clustering Correlations or Clustering Low reflectivity. Low reflectivity.– Memory: 64 - 512 Mb, Disk: 10 - 100 GbMemory: 64 - 512 Mb, Disk: 10 - 100 Gb
Low Reflectivity
0
10
20
30
40
50
0 10 20 30 40 50
Reflectivity: ExamplesReflectivity: Examples
Non-Reflective
0
10
20
30
40
50
0 10 20 30 40 50
High Reflectivity
0
10
20
30
40
50
0 10 20 30 40 50
Reflectivity: DefinitionReflectivity: Definition
Let Let – DD: dataset: dataset
– nni i : co-ordinates of point : co-ordinates of point xxi i
– reflections(reflections(xxi i ): permutations of ): permutations of nnii
((nnii ): # of points within distance ): # of points within distance rr of of nnii
((nnii ): # of reflections within distance ): # of reflections within distance rr of of nnii
Dx i
i
in
n
)(
)(
|D|
1tyReflectivi-Non
AlgorithmAlgorithm
How to compute match score (rank) of a document How to compute match score (rank) of a document for a given query?for a given query?
How to limit the number of documents for which the How to limit the number of documents for which the match score is computed?match score is computed?
Match Score of a DocumentMatch Score of a Document
Select Select kk numbers from numbers from DD yielding minimum distance yielding minimum distance between between QQ and and DD..
Relative distance for each term:Relative distance for each term:
Euclidean distance (Euclidean distance (LLpp norm) to combine term norm) to combine term
distances:distances:
ppk
i ji inqfDQF /1
1)),((),(
|ε|
||),(
i
jiji q
nqnqf
Bipartite Graph MatchingBipartite Graph Matching
Map problem to Bipartite Graph Matching Map problem to Bipartite Graph Matching – kk source nodes: corr. to query numbers source nodes: corr. to query numbers– mm target nodes: corr. to document numbers target nodes: corr. to document numbers– An edge from each source to An edge from each source to kk nearest targets. nearest targets.
Assign weight Assign weight f(qf(qii ,n ,njj))pp to the edge to the edge (q(qii ,n ,njj).).
20 60
10 25 75
.5 .25 .58 .25
Query:
Doc:
Limiting the Set of DocumentsLimiting the Set of Documents
Similar to the score aggregation problem [Fagin, Similar to the score aggregation problem [Fagin, PODS 96]PODS 96]
Proposed algorithm is an adaptation of the TA Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01]algorithm in [Fagin-Lotem-Naor, PODS 01]
Let Let nnii := number last looked at for query term := number last looked at for query term qqii
Let Let Halt when t documents found whose distance <= Halt when t documents found whose distance <= t is lower bound on distance of unseen documentst is lower bound on distance of unseen documents
Limiting the set of documents Limiting the set of documents
k conceptual sorted lists, one for each query term k conceptual sorted lists, one for each query term Do round robin access to the lists. For each Do round robin access to the lists. For each
document found, compute its distance F(D,Q)document found, compute its distance F(D,Q)
20
D4 D6 D8
D2 D3
25/.25 D9D1 D5 D7
60
D6 D8 D9
D5D1 D3 D4
D2 D7
10/.5
35/.75 25/.58
75/.25
66/.1
ppii
k
inqf /1
1)),((:τ
Empirical ResultsEmpirical Results
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5
Query Size
Pre
cisi
on
DRAM LCD Proc Trans AutoCredit Glass Housing Wine
Empirical Results (2)Empirical Results (2)
Screen ShotScreen Shot
Incorporating HintsIncorporating Hints
Use simple data extraction techniques to get hints, Use simple data extraction techniques to get hints,
Names/Units in query matched against Hints.Names/Units in query matched against Hints.
• 256 MB SDRAM memory
Unit Hint:MB
Attribute Hint:SDRAM, memory
SummarySummary
Allows querying using only numbers or numbers + Allows querying using only numbers or numbers + hints.hints.
Data can come from raw text (e.g. product Data can come from raw text (e.g. product descriptions) or databases.descriptions) or databases.
End run around data extraction.End run around data extraction.– Use simple extractor to generate hints.Use simple extractor to generate hints.
Open Problems: integration with keyword search.Open Problems: integration with keyword search.
Talk OutlineTalk Outline
Taxonomy Integration Taxonomy Integration Searching with Numbers Searching with Numbers Privacy-Preserving Data MiningPrivacy-Preserving Data Mining
– MotivationMotivation– ClassificationClassification– AssociationsAssociations
Growing Privacy ConcernsGrowing Privacy Concerns
Popular Press:Popular Press:– Economist: The End of Privacy (May 99)Economist: The End of Privacy (May 99)– Time: The Death of Privacy (Aug 97)Time: The Death of Privacy (Aug 97)
Govt. legislation:Govt. legislation:– European directive on privacy protection (Oct 98)European directive on privacy protection (Oct 98)– Canadian Personal Information Protection Act (Jan 2001)Canadian Personal Information Protection Act (Jan 2001)
Special issue on internet privacy, CACM, Feb 99Special issue on internet privacy, CACM, Feb 99 S. Garfinkel, "Database Nation: The Death of S. Garfinkel, "Database Nation: The Death of
Privacy in 21st Century", O' Reilly, Jan 2000Privacy in 21st Century", O' Reilly, Jan 2000
Privacy Concerns (2)Privacy Concerns (2)
Surveys of web usersSurveys of web users– 17% privacy fundamentalists, 56% pragmatic 17% privacy fundamentalists, 56% pragmatic
majority, 27% marginally concerned majority, 27% marginally concerned (Understanding net users' attitude about online (Understanding net users' attitude about online privacy, April 99)privacy, April 99)
– 82% said having privacy policy would matter 82% said having privacy policy would matter (Freebies & Privacy: What net users think, July (Freebies & Privacy: What net users think, July 99)99)
Technical QuestionTechnical Question
Fear:Fear:– "Join" (record overlay) was the original sin."Join" (record overlay) was the original sin.– Data mining: new, powerful adversary?Data mining: new, powerful adversary?
The primary task in data mining: development of The primary task in data mining: development of models about aggregated data.models about aggregated data.
Can we develop accurate models without access to Can we develop accurate models without access to precise information in individual data records?precise information in individual data records?
Talk OutlineTalk Outline
Taxonomy Integration Taxonomy Integration Searching with Numbers Searching with Numbers Privacy-Preserving Data MiningPrivacy-Preserving Data Mining
– MotivationMotivation– Private Information RetrievalPrivate Information Retrieval– Classification (SIGMOD 2000, with R. Agrawal)Classification (SIGMOD 2000, with R. Agrawal)– AssociationsAssociations
Web DemographicsWeb Demographics
Volvo S40 website targets people in 20sVolvo S40 website targets people in 20s– Are visitors in their 20s or 40s?Are visitors in their 20s or 40s?– Which demographic groups like/dislike the Which demographic groups like/dislike the
website?website?
Solution OverviewSolution Overview
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
Reconstructdistribution
of Age
Reconstructdistributionof Salary
Data MiningAlgorithms
Model
65 | 20K | ... 25 | 60K | ... ...
Reconstruction ProblemReconstruction Problem
Original values xOriginal values x11, x, x22, ..., x, ..., xnn
– from probability distribution X (unknown)from probability distribution X (unknown)
To hide these values, we use yTo hide these values, we use y11, y, y22, ..., y, ..., ynn
– from probability distribution Yfrom probability distribution Y GivenGiven
– xx11+y+y11, x, x22+y+y22, ..., x, ..., xnn+y+ynn
– the probability distribution of Ythe probability distribution of Y Estimate the probability distribution of X.Estimate the probability distribution of X.
Intuition (Reconstruct single Intuition (Reconstruct single point) point)
Use Bayes' rule for density functionsUse Bayes' rule for density functions
10 90Age
V
Original distribution for Age
Probabilistic estimate of original value of V
Intuition (Reconstruct single Intuition (Reconstruct single point)point)
Original Distribution for Age
Probabilistic estimate of original value of V
10 90Age
V
Use Bayes' rule for density functionsUse Bayes' rule for density functions
Reconstructing the Reconstructing the DistributionDistribution
Combine estimates of where point came from for all Combine estimates of where point came from for all the points:the points:– Gives estimate of original distribution.Gives estimate of original distribution.
10 90Age
Reconstruction: Reconstruction: BootstrappingBootstrapping
ffXX00 := Uniform distribution := Uniform distribution
j := 0 // Iteration numberj := 0 // Iteration number repeatrepeat
– (Bayes' rule)(Bayes' rule)
– j := j+1j := j+1 until (stopping criterion met)until (stopping criterion met)
Converges to maximum likelihood estimate.Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.D. Agrawal & C.C. Aggarwal, PODS 2001.
n
ijXiiY
jXiiYj
x
afayxf
afayxf
naf
1
1
)())((
)())((1:)(
Seems to work well!Seems to work well!
0
200
400
600
800
1000
1200
20 60
Age
Num
ber
of P
eopl
e
OriginalRandomizedReconstructed
Recap: Why is privacy Recap: Why is privacy preserved?preserved?
Cannot reconstruct individual values accurately.Cannot reconstruct individual values accurately. Can only reconstruct distributions.Can only reconstruct distributions.
Talk OutlineTalk Outline
Taxonomy Integration Taxonomy Integration Searching with Numbers Searching with Numbers Privacy-Preserving Data MiningPrivacy-Preserving Data Mining
– MotivationMotivation– Private Information RetrievalPrivate Information Retrieval– ClassificationClassification – Associations (KDD 2002, with A. Evfimievski, R. Associations (KDD 2002, with A. Evfimievski, R.
Agrawal & J. Gehrke)Agrawal & J. Gehrke)
Association RulesAssociation Rules
Given:Given:– a set of transactionsa set of transactions– each transaction is a set of itemseach transaction is a set of items
Association Rule: 30% of transactions that contain Association Rule: 30% of transactions that contain Book1 and Book5 also contain Book20; 5% of Book1 and Book5 also contain Book20; 5% of transactions contain these items.transactions contain these items.– 30% : confidence of the rule.30% : confidence of the rule.– 5% : support of the rule.5% : support of the rule.
Find all association rules that satisfy user-specified Find all association rules that satisfy user-specified minimum support and minimum confidence minimum support and minimum confidence constraints.constraints.
Can be used to generate recommendations.Can be used to generate recommendations.
Recommendation Service
Associations
Recommendations
Alice
Bob
Book 5,Book 25
Book 1,Book 11,Book 21
Recommendations OverviewRecommendations Overview
Support Recovery
Book 3,Book 25
Book 1,Book 7,Book 21
Private Information RetrievalPrivate Information Retrieval
Retrieve 1 of n documents from a digital library Retrieve 1 of n documents from a digital library without the library knowing which document was without the library knowing which document was retrieved.retrieved.
Trivial solution: Download entire library.Trivial solution: Download entire library. Can you do better?Can you do better?
– Yes, with multiple servers.Yes, with multiple servers.– Yes, with single server & computational privacy.Yes, with single server & computational privacy.
Problem introduced in [Chor et al, FOCS 95]Problem introduced in [Chor et al, FOCS 95]
Uniform RandomizationUniform Randomization
Given a transaction,Given a transaction,– keep item with 20% probability,keep item with 20% probability,– replace with a new random item with 80% replace with a new random item with 80%
probability.probability. Appears to gives around 80% privacy…Appears to gives around 80% privacy…
– 80% chance that an item in the randomized 80% chance that an item in the randomized transaction was not in the original transaction.transaction was not in the original transaction.
Privacy Breach ExamplePrivacy Breach Example
100,000 (1%)have
{x, y, z}
9,900,000 (99%)have zero
items from {x, y, z}
0.23 = .0086 * (0.8/1000)3
= 3 * 10-9
800 transactions .03 transactions (<< 1)
99.99% 0.01%
80% privacy “on average,” but not for all items!80% privacy “on average,” but not for all items!
10 M transactions of size 3 with 1000 items:
SolutionSolution
“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”
“He grows a forest to hide it in.”
G.K. Chesterton
Insert many false items into each transaction.Insert many false items into each transaction.Hide true itemsets among false ones.Hide true itemsets among false ones.No free lunch: Need more transactions to discover No free lunch: Need more transactions to discover associations.associations.
Related WorkRelated Work
S. Rizvi, J. Haritsa, “Privacy-Preserving Association S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002.Rule Mining”, VLDB 2002.
Protecting privacy across databases:Protecting privacy across databases:– Y. Lindell and B. Pinkas, “Privacy Preserving Y. Lindell and B. Pinkas, “Privacy Preserving
Data Mining”, Crypto 2000.Data Mining”, Crypto 2000.– J. Vaidya and C.W. Clifton, “Privacy Preserving J. Vaidya and C.W. Clifton, “Privacy Preserving
Association Rule Mining in Vertically Partitioned Association Rule Mining in Vertically Partitioned Data”, KDD 2002.Data”, KDD 2002.
SummarySummary
Have your cake and mine it too!Have your cake and mine it too!– Preserve privacy at the individual level, but still Preserve privacy at the individual level, but still
build accurate models.build accurate models.– Can do both classification & association rules.Can do both classification & association rules.
Open Problems: Clustering, Open Problems: Clustering, Lower bounds on Lower bounds on discoverability versus privacy, Faster algorithms, …discoverability versus privacy, Faster algorithms, …
Slides available from ...Slides available from ...
www.almaden.ibm.com/cs/people/srikant/www.almaden.ibm.com/cs/people/srikant/talks.htmltalks.html
BackupBackup
Lowest Discoverable SupportLowest Discoverable Support
LDS is s.t., when predicted, LDS is s.t., when predicted, is 4is 4 away from zero. away from zero.
Roughly, LDS is Roughly, LDS is proportional to proportional to
LDS vs. number of transactions
0
0.2
0.4
0.6
0.8
1
1.2
1 10 100Number of transactions, millions
LDS,
%
1-itemsets 2-itemsets 3-itemsets
|t| = 5, = 50%
T1
LDS vs. Breach LevelLDS vs. Breach Level
0
0.5
1
1.5
2
2.5
30 40 50 60 70 80 90
Privacy Breach Level, %
LD
S,
%
|t| = 5, |T| = 5 M
Basic 2-server SchemeBasic 2-server Scheme
Each server returns Each server returns XOR of green bits.XOR of green bits.
Client XORs bits Client XORs bits returned by server.returned by server.
Communication Communication complexity: O(n)complexity: O(n)
1
2
3
4
6
5
7
8
Sqrt(n) AlgorithmSqrt(n) Algorithm
Each server returns bit-Each server returns bit-wise XOR of specified wise XOR of specified blocks.blocks.
Client XORs the 2 blocks Client XORs the 2 blocks & selects desired bits.& selects desired bits.
Each block has sqrt(n) Each block has sqrt(n) elements => 4*sqrt(n) elements => 4*sqrt(n) communication communication complexity.complexity.
Server computation time Server computation time still O(n)still O(n)
1
2
3
4
6
5
7
8
Computationally Private IRComputationally Private IR
Use pseudo-random function + mask to generate Use pseudo-random function + mask to generate sets.sets.
Quadratic residuosity.Quadratic residuosity. Difficulty of deciding whether a small prime divides Difficulty of deciding whether a small prime divides
(m)(m)– m: composite integer of unknown factorizationm: composite integer of unknown factorization (m): Euler totient fn, i.e., # of positive integers (m): Euler totient fn, i.e., # of positive integers
<=m that are relatively prime to m.<=m that are relatively prime to m.
ExtensionsExtensions
Retrieve documents (blocks), not bits.Retrieve documents (blocks), not bits.– If If n <= ln <= l, comm. complexity , comm. complexity 4l4l..– If If n <= ln <= l22/4/4, comm. complexity , comm. complexity 8l8l..
Lower communication complexity.Lower communication complexity. Select documents using keywords.Select documents using keywords. Protect data privacy.Protect data privacy. Preprocessing to reduce computation time.Preprocessing to reduce computation time. Computationally-private information retrieval with Computationally-private information retrieval with
single server.single server.
Potential Privacy BreachesPotential Privacy Breaches
Distribution is a spike.Distribution is a spike.– ExampleExample: Everyone is of age 40.: Everyone is of age 40.
Some randomized values are only possible from a Some randomized values are only possible from a given range.given range.– ExampleExample: Add U[-50,+50] to age and get 125 : Add U[-50,+50] to age and get 125
True age is True age is 75. 75.– Not an issue with Gaussian.Not an issue with Gaussian.
Potential Privacy Breaches (2)Potential Privacy Breaches (2)
Most randomized values in a given interval come Most randomized values in a given interval come from a given interval.from a given interval.– ExampleExample: 60% of the people whose randomized : 60% of the people whose randomized
value is in [120,130] have their true age in value is in [120,130] have their true age in [70,80].[70,80].
– Implication: Higher levels of randomization will Implication: Higher levels of randomization will be required.be required.
Correlations can make previous effect worse.Correlations can make previous effect worse.– ExampleExample: 80% of the people whose randomized : 80% of the people whose randomized
value of age is in [120,130] and whose value of age is in [120,130] and whose randomized value of income is [...] have their randomized value of income is [...] have their true age in [70,80].true age in [70,80].
Work in Statistical DatabasesWork in Statistical Databases
Provide statistical information without compromising Provide statistical information without compromising sensitive information about individuals (surveys: sensitive information about individuals (surveys: AW89, Sho82)AW89, Sho82)
TechniquesTechniques– Query RestrictionQuery Restriction– Data PerturbationData Perturbation
Negative Results: cannot give high quality statistics Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of and simultaneously prevent partial disclosure of individual information [AW89]individual information [AW89]
Statistical Databases: Statistical Databases: TechniquesTechniques Query RestrictionQuery Restriction
– restrict the size of query result (e.g. FEL72, DDS79)restrict the size of query result (e.g. FEL72, DDS79)– control overlap among successive queries (e.g. DJL79)control overlap among successive queries (e.g. DJL79)– suppress small data cells (e.g. CO82)suppress small data cells (e.g. CO82)
Output PerturbationOutput Perturbation– sample result of query (e.g. Den80)sample result of query (e.g. Den80)– add noise to query result (e.g. Bec80)add noise to query result (e.g. Bec80)
Data PerturbationData Perturbation– replace db with sample (e.g. LST83, LCL85, Rei84)replace db with sample (e.g. LST83, LCL85, Rei84)– swap values between records (e.g. Den82)swap values between records (e.g. Den82)– add noise to values (e.g. TYW84, War65)add noise to values (e.g. TYW84, War65)
Statistical Databases: Statistical Databases: ComparisonComparison
We do not assume original data is aggregated into We do not assume original data is aggregated into a single database.a single database.
Concept of reconstructing original distribution.Concept of reconstructing original distribution.– Adding noise to data values problematic without Adding noise to data values problematic without
such reconstruction.such reconstruction.
Recommended