26
Sovereign Information Sovereign Information Sharing and Mining in a Sharing and Mining in a Connected World Connected World R. Agrawal R. Agrawal Intelligent Information Systems Intelligent Information Systems Research Research IBM Almaden Research Center, San Jose, CA 95120 Joint Work with: D. Asonov, P. Baliga, A. D. Asonov, P. Baliga, A. Evfimieviski, L. Liang, B. Porst, R. Srikant Evfimieviski, L. Liang, B. Porst, R. Srikant

Sovereign Information Sharing and Mining in a Connected World

  • Upload
    manning

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Sovereign Information Sharing and Mining in a Connected World. R. Agrawal Intelligent Information Systems Research IBM Almaden Research Center, San Jose, CA 95120 Joint Work with: D. Asonov, P. Baliga, A. Evfimieviski, L. Liang, B. Porst, R. Srikant. Outline. Information sharing today - PowerPoint PPT Presentation

Citation preview

Page 1: Sovereign Information Sharing and Mining in a Connected World

Sovereign Information Sharing and Sovereign Information Sharing and Mining in a Connected WorldMining in a Connected World

R. AgrawalR. Agrawal

Intelligent Information Systems ResearchIntelligent Information Systems ResearchIBM Almaden Research Center, San Jose, CA 95120

Joint Work with: D. Asonov, P. Baliga, A. Evfimieviski, L. Liang, B. D. Asonov, P. Baliga, A. Evfimieviski, L. Liang, B. Porst, R. SrikantPorst, R. Srikant

Page 2: Sovereign Information Sharing and Mining in a Connected World

OutlineOutline

Information sharing todayInformation sharing today The new worldThe new world Some solution approachesSome solution approaches Observations on privacy-preserving data miningObservations on privacy-preserving data mining Musings about the futureMusings about the future

R. Agrawal, A. Evfimievski, R. Srikant. R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private DatabasesInformation Sharing Across Private Databases. SIGMOD . SIGMOD 03.03.

R. Agrawal, D. Asonov, R. Srikant. R. Agrawal, D. Asonov, R. Srikant. Enabling Sovereign Information Sharing Using Web ServicesEnabling Sovereign Information Sharing Using Web Services. . SIGMOD 04 (Industrial Track).SIGMOD 04 (Industrial Track).

R. Agrawal, D. Asonov, P. Baliga, L. Liang, B. Porst, R. Srikant. R. Agrawal, D. Asonov, P. Baliga, L. Liang, B. Porst, R. Srikant. A Reusable Platform for Building A Reusable Platform for Building Sovereign Information Sharing ApplicationsSovereign Information Sharing Applications. DIVO 04.. DIVO 04.

Page 3: Sovereign Information Sharing and Mining in a Connected World

Assumption: Information in each database can be Assumption: Information in each database can be freely shared.freely shared.

Information Sharing TodayInformation Sharing Today

Mediator

Q R

Federated

Q R

Centralized

Page 4: Sovereign Information Sharing and Mining in a Connected World

Need for a new style of Need for a new style of information sharinginformation sharing

Compute queries across databases so that no more Compute queries across databases so that no more information than necessary is revealed (without information than necessary is revealed (without using a trusted third party).using a trusted third party).

Need is driven by several trends:Need is driven by several trends:– End-to-end integration of information systems End-to-end integration of information systems

across companies (virtual organizations)across companies (virtual organizations)– Simultaneously compete and cooperate.Simultaneously compete and cooperate.– Security: need-to-know information sharingSecurity: need-to-know information sharing

Page 5: Sovereign Information Sharing and Mining in a Connected World

Security ApplicationSecurity Application

Security Agency finds Security Agency finds those passengers who those passengers who are in its list of suspects, are in its list of suspects, but not the names of but not the names of other passengers.other passengers.

Airline does not find Airline does not find anything.anything.

Agency

SuspectList

Airline

PassengerList

http://www.informationweek.com/story/showArticle.jhtml?articleID=184010%79

Page 6: Sovereign Information Sharing and Mining in a Connected World

Epidemiological Research Epidemiological Research

Validate hypothesis Validate hypothesis between adverse between adverse reaction to a drug and a reaction to a drug and a specific DNA sequence.specific DNA sequence.

Researcher should not Researcher should not learn anything beyond 4 learn anything beyond 4 counts:counts:

MedicalResearch

Inst.

DNA Sequences

DrugReactions

Adverse ReactionAdverse Reaction No Adv. ReactionNo Adv. Reaction

Sequence PresentSequence Present ?? ??

Sequence AbsentSequence Absent ?? ??

Page 7: Sovereign Information Sharing and Mining in a Connected World

R S R must not

know that S has b & y

S must not know that R has a & x

uu

vv

RSaa

uu

vv

xx

bb

uu

vv

yy

R

S

Count (R S) R & S do not learn

anything except that the result is 2.

Minimal Necessary SharingMinimal Necessary Sharing

Page 8: Sovereign Information Sharing and Mining in a Connected World

Problem Statement:Problem Statement:Minimal SharingMinimal Sharing

Given:Given:– Two parties (honest-but-curious): R (receiver) and S Two parties (honest-but-curious): R (receiver) and S

(sender)(sender)– Query Q spanning the tables R and SQuery Q spanning the tables R and S– Additional (pre-specified) categories of information Additional (pre-specified) categories of information II

Compute the answer to Q and return it to R without revealing Compute the answer to Q and return it to R without revealing any additional information to either party, any additional information to either party, except for the except for the information contained in information contained in II– For example, in the upcoming intersection protocolsFor example, in the upcoming intersection protocols

II = { |R| , |S| } = { |R| , |S| }

Page 9: Sovereign Information Sharing and Mining in a Connected World

A Possible ApproachA Possible Approach

Secure Multi-Party ComputationSecure Multi-Party Computation– Given two parties with inputs x and y, compute f(x,y) such Given two parties with inputs x and y, compute f(x,y) such

that the parties learn only f(x,y) and nothing else.that the parties learn only f(x,y) and nothing else.– Can be solved by building a combinatorial circuit, and Can be solved by building a combinatorial circuit, and

simulating that circuit [Yao86].simulating that circuit [Yao86].

Prohibitive cost for database-size problems.Prohibitive cost for database-size problems.– Intersection of two relations of a million records each Intersection of two relations of a million records each

would require 144 days (Yao’s protocol)would require 144 days (Yao’s protocol)

Page 10: Sovereign Information Sharing and Mining in a Connected World

Intersection ProtocolIntersection Protocol

RS

R S

Secret key

a b

fb(S )

Shorthand for { fb(s) | s S }

Commutative Encryptionfa(fb(s)) = fb(fa(s))

f(s,b,p) = sb mod p

Page 11: Sovereign Information Sharing and Mining in a Connected World

R

Intersection ProtocolIntersection Protocol

S

R S

fb(S)fb(S )

fa(fb(S ))

a b

fb(fa(S ))

Commutative property

Page 12: Sovereign Information Sharing and Mining in a Connected World

R

Intersection ProtocolIntersection Protocol

S

R

S

fa(R )

fa(R )

fb(fa(S ))

{< fa(r ), fb(fa(r ))>}

a b

<r, fb(fa(x))>

{< fa(r ), fb(fa(r ))>}

Since R knows<r, fa(r)>

Page 13: Sovereign Information Sharing and Mining in a Connected World

Related WorkRelated Work

[Naor & Pinkas 99]: Two protocols for list [Naor & Pinkas 99]: Two protocols for list intersection problemintersection problem– Oblivious evaluation of n polynomials of degree n each.Oblivious evaluation of n polynomials of degree n each.– Oblivious evaluation of nOblivious evaluation of n22 linear polynomials. linear polynomials.

[Huberman et al 99]: find people with common [Huberman et al 99]: find people with common preferences, without revealing the preferences.preferences, without revealing the preferences.– Intersection protocols are similar Intersection protocols are similar

[Clifton et al, 03]: Secure set union and set [Clifton et al, 03]: Secure set union and set intersectionintersection– Similar protocolsSimilar protocols

Page 14: Sovereign Information Sharing and Mining in a Connected World

Implementation: Grid of Data ServicesImplementation: Grid of Data Services

DP DBServer

meta data

DataProvider

SIS Server n

DP DBServer

meta data

DataProvider

SIS Server 1

Application

SIS Client

UserApplicationDeveloper

ClientMetadata

SIS Platform

Constructs web service query requests against multiple data providers, and collects responses.

Mapping information and data provider

access information.

Thin layer on top of the SIS client: invokes the required SIS operations, provides an interface to a SIS user.

Includes view information to retrieve data from the data

provider database, database access information, and

context information.

Provides the necessary functionality on the data provider side to enable

sovereign sharing.

Templates to aid application development

Page 15: Sovereign Information Sharing and Mining in a Connected World

System IssuesSystem Issues

How does the application developer find the necessary data How does the application developer find the necessary data sources and their schemas? (sources and their schemas? (resource discoveryresource discovery mechanismmechanism))• Employ a UDDI registry to store and searchEmploy a UDDI registry to store and search– data providers and operations they supportdata providers and operations they support– available schemas for each data provideravailable schemas for each data provider

How does the application developer link the data between How does the application developer link the data between different providers? (different providers? (schema mappingschema mapping mechanismmechanism))• Data providers publish schemas in their own vocabularies.Data providers publish schemas in their own vocabularies.• Developers link the schemas.Developers link the schemas.

How to ensure that only eligible users can carry out the How to ensure that only eligible users can carry out the computation? (computation? (authenticationauthentication mechanismmechanism))• Authentication across multiple domainsAuthentication across multiple domains

Page 16: Sovereign Information Sharing and Mining in a Connected World

Implementation EnvironmentImplementation Environment

Data resides inData resides in DB2 v.8.1. database systems, DB2 v.8.1. database systems, installed on 2.4GHz/ 512MB RAM Intelinstalled on 2.4GHz/ 512MB RAM Intel workstations, connected by a 100Mbit LAN network.workstations, connected by a 100Mbit LAN network.

Web services runWeb services run on top of the IBM WebSphere on top of the IBM WebSphere Application Server v.5.0 and use Application Server v.5.0 and use Apache AXIS Apache AXIS v.1.1. SOAP library for messaging.v.1.1. SOAP library for messaging.

IBMIBM private UDDI registry installed on one of the private UDDI registry installed on one of the machines.machines.

Page 17: Sovereign Information Sharing and Mining in a Connected World

PerformancePerformance

ImplementationImplementation msms

Java programJava program 3232

Java DB2 UDFJava DB2 UDF 33-3433-34

Exponentiation time for Exponentiation time for one number (Intel P3)one number (Intel P3)

65 msMS Visual C++ (Crypto++

library)

Page 18: Sovereign Information Sharing and Mining in a Connected World

Making Encryption Faster: Making Encryption Faster: Software ApproachesSoftware Approaches

The main component of encryption is exponentiation: The main component of encryption is exponentiation: enc(x, k, enc(x, k, p) = xp) = xkk mod p mod p

Tried custom implementations of exponentiation that used Tried custom implementations of exponentiation that used preprocessing based onpreprocessing based on– fixed exponent (k)fixed exponent (k)

– fixed base (x)fixed base (x) Fixed exponent implementation turned out to be slower than Fixed exponent implementation turned out to be slower than

the Java native implementationthe Java native implementation Fixed base is beneficial if the same value is encrypted Fixed base is beneficial if the same value is encrypted

multiple times with different keys (not useful for intersection multiple times with different keys (not useful for intersection where each value is encrypted once)where each value is encrypted once)

Page 19: Sovereign Information Sharing and Mining in a Connected World

Making Encryption Faster: Making Encryption Faster: Hardware AcceleratorHardware Accelerator

Use SSL card to speed-up exponentiationUse SSL card to speed-up exponentiation Multiple threads (100+) must post exponentiation request Multiple threads (100+) must post exponentiation request

simultaneously to the card API to get the advertised simultaneously to the card API to get the advertised speed-upspeed-up

AEP scheduler distributes exponentiation requests AEP scheduler distributes exponentiation requests between multiple cards automatically; linear speed-upbetween multiple cards automatically; linear speed-up

Example:Example:AEP SSL CARD Runner 2000AEP SSL CARD Runner 2000≈ ≈ $2k$2k

Page 20: Sovereign Information Sharing and Mining in a Connected World

Execution time: Encryption UDFExecution time: Encryption UDF

Encryption EngineEncryption Engine Number of rows in the tableNumber of rows in the table

1,0001,000 5,0005,000 10,00010,000

CPU Intel III 2.0 GhzCPU Intel III 2.0 Ghz 3434ss 175175ss 320320ss

AEP Runner 2000AEP Runner 2000 3.53.5ss 1919ss 3737ss

Page 21: Sovereign Information Sharing and Mining in a Connected World

Application PerformanceApplication Performance

Encryption speed is 20K encryptions per minute Encryption speed is 20K encryptions per minute using one accelerator card ($2K per card)using one accelerator card ($2K per card)

Airline application: 150,000 (daily) passengers and Airline application: 150,000 (daily) passengers and 1 million people in the watch list:1 million people in the watch list:

120 minutes with one accelerator card120 minutes with one accelerator card 12 minutes with ten accelerator cards 12 minutes with ten accelerator cards

Epidemiological research: 1 million patient records Epidemiological research: 1 million patient records in the hospital and 10 million records in the in the hospital and 10 million records in the Genebank:Genebank:

37 hours with one accelerator cards37 hours with one accelerator cards 3.7 hours with ten accelerator cards3.7 hours with ten accelerator cards

Page 22: Sovereign Information Sharing and Mining in a Connected World

Current WorkCurrent Work

Use of secure coprocessors to addressUse of secure coprocessors to address– Richer join operationsRicher join operations– PerformancePerformance– Semi-dishonestySemi-dishonesty

Incentive compatibility and auditing to address Incentive compatibility and auditing to address maliciousnessmaliciousness

IBM 4764cryptographic coprocessor

Page 23: Sovereign Information Sharing and Mining in a Connected World

Privacy Preserving Data Mining: Privacy Preserving Data Mining: The Randomization ApproachThe Randomization Approach

To hide original values xTo hide original values x11, x, x22, ..., x, ..., xnn

– from probability distribution X (unknown)from probability distribution X (unknown)

we use ywe use y11, y, y22, ..., y, ..., ynn

– from probability distribution Yfrom probability distribution Y Problem: GivenProblem: Given

– xx11+y+y11, x, x22+y+y22, ..., x, ..., xnn+y+ynn

– the probability distribution of Ythe probability distribution of Y Estimate the probability distribution of X.Estimate the probability distribution of X. Use the estimated distribution of X to build the classification Use the estimated distribution of X to build the classification

modelmodel Extended subsequently to mining Association rules while Extended subsequently to mining Association rules while

preserving the privacy of individual transactionspreserving the privacy of individual transactionsR. Agrawal, R. Srikant. R. Agrawal, R. Srikant. Privacy Preserving Data MiningPrivacy Preserving Data Mining. SIGMOD 00.. SIGMOD 00.

A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining of Association RulesPrivacy Preserving Mining of Association Rules. . SIGKDD 02. SIGKDD 02.

Page 24: Sovereign Information Sharing and Mining in a Connected World

Distributed SettingDistributed Setting

Application scenario: A central server interested in building a Application scenario: A central server interested in building a data mining model using data obtained from a large number of data mining model using data obtained from a large number of clients, while preserving their privacyclients, while preserving their privacy– Web-commerce, e.g. recommendation serviceWeb-commerce, e.g. recommendation service

Desiderata:Desiderata:– Must not slow-down the speed of client interactionMust not slow-down the speed of client interaction– Must scale to very large number of clientsMust scale to very large number of clients

During the application phase During the application phase – Ship model to the clientsShip model to the clients– Use oblivious computationsUse oblivious computations

Implication:Implication:– Action taken to preserve privacy of a record must not depend on Action taken to preserve privacy of a record must not depend on

other recordsother records– Fast, per-transaction perturbation (potential loss in accuracy)Fast, per-transaction perturbation (potential loss in accuracy)

Page 25: Sovereign Information Sharing and Mining in a Connected World

Inter-Enterprise SettingInter-Enterprise Setting

A party has access to all the records in its databaseA party has access to all the records in its database– Considerable increase in available optionsConsiderable increase in available options

Cryptographic approachesCryptographic approaches– Lindell & Pinkas [Crypto 2000]Lindell & Pinkas [Crypto 2000]– Purdue Toolkit [Clifton et al 2003]Purdue Toolkit [Clifton et al 2003]

Global approaches (e.g. swapping) from SDCGlobal approaches (e.g. swapping) from SDC Model combination and VotingModel combination and Voting

– Potential for leakage from individual modelsPotential for leakage from individual models

Tradeoff between Generality, Performance, Accuracy, andPotential disclosure: Not Well understood

Page 26: Sovereign Information Sharing and Mining in a Connected World

OutlookOutlook

Three stages of Network eraThree stages of Network era**

– Brochure stage (informational websites)Brochure stage (informational websites)

– Transaction stage (e-commerce, online banking, etc.)Transaction stage (e-commerce, online banking, etc.)

– E-business on demand (integrate business processes within and E-business on demand (integrate business processes within and with external parties; dynamic virtual organizations)with external parties; dynamic virtual organizations)

The on demand era is presenting research opportunities for The on demand era is presenting research opportunities for discontinuous thinkingdiscontinuous thinking

Sovereign information sharing is one such key opportunity, Sovereign information sharing is one such key opportunity, but challenges abound:but challenges abound:– Fast, scalable, and composable protocolsFast, scalable, and composable protocols

– New framework for thinking about ownership, privacy, and New framework for thinking about ownership, privacy, and security (zero-leakage model does not scale)security (zero-leakage model does not scale)

**IBM. IBM. Living in an On Demand WorldLiving in an On Demand World. October 2002.. October 2002.