24
Data mining is…… Data mining is…… Data mining is the process of analyzing data Data mining is the process of analyzing data from different perspectives and summarizing it from different perspectives and summarizing it into useful information - information that can be into useful information - information that can be used to increase revenue, cuts costs, or both. used to increase revenue, cuts costs, or both. A user-centric, interactive process which A user-centric, interactive process which leverages analysis technologies and computing leverages analysis technologies and computing power power A group of techniques that find relationships A group of techniques that find relationships that have not previously been discovered that have not previously been discovered Not reliant on an existing database Not reliant on an existing database A relatively easy task that requires knowledge A relatively easy task that requires knowledge of the business problem/subject matter expertise of the business problem/subject matter expertise

Data mining and privacy preserving in data mining

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Data mining and privacy preserving in data mining

Data mining is……Data mining is……

•Data mining is the process of analyzing data Data mining is the process of analyzing data from different perspectives and summarizing it from different perspectives and summarizing it into useful information - information that can be into useful information - information that can be used to increase revenue, cuts costs, or both. used to increase revenue, cuts costs, or both.

•A user-centric, interactive process which A user-centric, interactive process which leverages analysis technologies and computing leverages analysis technologies and computing powerpower

•A group of techniques that find relationships A group of techniques that find relationships that have not previously been discoveredthat have not previously been discovered

•Not reliant on an existing databaseNot reliant on an existing database

•A relatively easy task that requires knowledge A relatively easy task that requires knowledge of the business problem/subject matter expertiseof the business problem/subject matter expertise

Page 2: Data mining and privacy preserving in data mining

Data mining is Data mining is notnot

•““Blind” application of algorithmsBlind” application of algorithms•Going to find relationships where Going to find relationships where none existnone exist•Presenting data in different waysPresenting data in different ways•A database intensive taskA database intensive task•A difficult to understand technology A difficult to understand technology requiring an advanced degree in requiring an advanced degree in computer sciencecomputer science

Page 3: Data mining and privacy preserving in data mining

Evolutionary Step Business Question EnablingTechnologies

Product Providers Characteristics

Data Collection(1960s)

"What was my totalrevenue in the lastfive years?"

Computers, tapes,disks

IBM, CDC Retrospective,static data delivery

Data Access(1980s)

"What were unitsales in NewEngland lastMarch?"

Relationaldatabases(RDBMS),Structured QueryLanguage (SQL),ODBC

Oracle, Sybase,Informix, IBM,Microsoft

Retrospective,dynamic datadelivery at recordlevel

Data Warehousing& DecisionSupport(1990s)

"What were unitsales in NewEngland lastMarch? Drill downto Boston."

On-line analyticprocessing(OLAP),multidimensionaldatabases, datawarehouses

SPSS, Comshare,Arbor, Cognos,Microstrategy,NCR

Retrospective,dynamic datadelivery at multiplelevels

Data Mining(Emerging Today)

"What’s likely tohappen to Bostonunit sales nextmonth? Why?"

Advancedalgorithms,multiprocessorcomputers, massivedatabases

SPSS/Clementine,Lockheed, IBM,SGI, SAS, NCR,Oracle, numerousstartups

Prospective,proactiveinformationdelivery

The Evolution of Data Analysis

Page 4: Data mining and privacy preserving in data mining

Results for data mining…….Results for data mining…….

• Forecasting what may happen in the Forecasting what may happen in the futurefuture

• Classifying people or things into Classifying people or things into groups by recognizing patternsgroups by recognizing patterns

• Clustering people or things into Clustering people or things into groups based on their attributesgroups based on their attributes

• Associating what events are likely to Associating what events are likely to occur togetheroccur together

• Sequencing what events are likely to Sequencing what events are likely to lead to later eventslead to later events

Page 5: Data mining and privacy preserving in data mining

Why Should There be a Why Should There be a Standard Process?Standard Process?

•Framework for recording Framework for recording experienceexperience

– Allows projects to be Allows projects to be replicatedreplicated

•Aid to project planning and Aid to project planning and managementmanagement

•““Comfort factor” for new Comfort factor” for new adoptersadopters

– Demonstrates maturity of Demonstrates maturity of Data MiningData Mining

– Reduces dependency on Reduces dependency on “stars”“stars”

The data mining process must be reliable and The data mining process must be reliable and repeatable by people with little data mining repeatable by people with little data mining background.background.

Page 6: Data mining and privacy preserving in data mining

Data mining vs OLAPData mining vs OLAP

OLAP - On-line Analytical Processing

Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

Page 7: Data mining and privacy preserving in data mining

Data mining vs statistical Data mining vs statistical analysisanalysis•Data MiningData Mining

– Originally developed to act Originally developed to act as expert systems to solve as expert systems to solve problemsproblems

– Less interested in the Less interested in the mechanics of the mechanics of the techniquetechnique

– If it makes sense then let’s If it makes sense then let’s use ituse it

– Does not require Does not require assumptions to be made assumptions to be made about dataabout data

– Can find patterns in very Can find patterns in very large amounts of datalarge amounts of data

– Requires understanding of Requires understanding of data and business problemdata and business problem

•Data AnalysisData Analysis– Tests for statistical Tests for statistical

correctness of modelscorrectness of models• Are statistical Are statistical

assumptions of models assumptions of models correct?correct?

– Eg Is the R-Square Eg Is the R-Square good?good?

– Hypothesis testingHypothesis testing• Is the relationship Is the relationship

significant?significant?– Use a t-test to validate Use a t-test to validate

significancesignificance– Tends to rely on samplingTends to rely on sampling– Techniques are not optimised Techniques are not optimised

for large amounts of datafor large amounts of data– Requires strong statistical Requires strong statistical

skillsskills

Page 8: Data mining and privacy preserving in data mining

Examples of what people Examples of what people are doing with data miningare doing with data mining•Fraud/Non-Compliance Fraud/Non-Compliance Anomaly detectionAnomaly detection

– Isolate the factors that Isolate the factors that lead to fraud, waste and lead to fraud, waste and abuseabuse

– Target auditing and Target auditing and investigative efforts investigative efforts more effectivelymore effectively

•Credit/Risk ScoringCredit/Risk Scoring•Intrusion detection Intrusion detection •Parts failure prediction Parts failure prediction

•Recruiting/Attracting Recruiting/Attracting customers customers •Maximizing profitability Maximizing profitability (cross selling, identifying (cross selling, identifying profitable customers) profitable customers) •Service Delivery and Service Delivery and Customer Retention Customer Retention

– Build profiles of Build profiles of customers likely to use customers likely to use which serviceswhich services

•Web MiningWeb Mining

Page 9: Data mining and privacy preserving in data mining

Why Should There be a Why Should There be a Standard Process?Standard Process?

•Framework for recording Framework for recording experienceexperience

– Allows projects to be Allows projects to be replicatedreplicated

•Aid to project planning and Aid to project planning and managementmanagement

•““Comfort factor” for new Comfort factor” for new adoptersadopters

– Demonstrates maturity of Demonstrates maturity of Data MiningData Mining

– Reduces dependency on Reduces dependency on “stars”“stars”

The data mining process must be reliable and The data mining process must be reliable and repeatable by people with little data mining repeatable by people with little data mining background.background.

Page 10: Data mining and privacy preserving in data mining

• Data mining is extensively used for knowledge Data mining is extensively used for knowledge discovery from large databases. discovery from large databases.

• The problem with data mining is that with the The problem with data mining is that with the availability of non-sensitive information, one is availability of non-sensitive information, one is able to infer sensitive information that is not to be able to infer sensitive information that is not to be disclosed. disclosed.

• Thus privacy is becoming an increasingly Thus privacy is becoming an increasingly important issue in many data mining applications.important issue in many data mining applications.

• This has led to the development of privacy This has led to the development of privacy preserving data mining.preserving data mining.

What is Data Mining and Privacy What is Data Mining and Privacy Preservation all about:Preservation all about:

Page 11: Data mining and privacy preserving in data mining

• Perform Data-mining on union of two Perform Data-mining on union of two private databasesprivate databases

• Data stays private i.e. no party Data stays private i.e. no party learns anything but outputlearns anything but output

• meeting privacy requirements. Study of meeting privacy requirements. Study of achieving some data mining goals without achieving some data mining goals without scarifying the privacy of the individuals scarifying the privacy of the individuals

• providing valid data mining results.providing valid data mining results.

ObjectiveObjective

Page 12: Data mining and privacy preserving in data mining

How Do We Do It?How Do We Do It?

There are two approaches to preserve privacy:There are two approaches to preserve privacy:

• The first approach protects the privacy of the The first approach protects the privacy of the data by using an extended role based access data by using an extended role based access control approach where sensitive objects control approach where sensitive objects identification is used to protect an individual’s identification is used to protect an individual’s privacy. privacy.

• The second approach uses cryptographic The second approach uses cryptographic techniques.techniques.

Page 13: Data mining and privacy preserving in data mining

Cryptographic techniques Cryptographic techniques for PPDMfor PPDM• To run the data mining algorithm on the To run the data mining algorithm on the

union of their databases without union of their databases without revealing any unnecessary information.revealing any unnecessary information.

• consider separate medical institutions consider separate medical institutions that wish to conduct a joint research that wish to conduct a joint research while preserving the privacy of their while preserving the privacy of their patients.patients.

• Protection of privileged information, Protection of privileged information, along with its use for research.along with its use for research.

Page 14: Data mining and privacy preserving in data mining

How much privacy?How much privacy?

• if a data mining algorithm is run against if a data mining algorithm is run against the union of the databases, and its the union of the databases, and its output becomes known to one or more output becomes known to one or more of the parties, it reveals something of the parties, it reveals something about the contents of the other about the contents of the other databases.databases.

• leak of information is inevitable, leak of information is inevitable, however, if the parties need to learn this however, if the parties need to learn this output.output.

Page 15: Data mining and privacy preserving in data mining

What is Cryptography?What is Cryptography?

• The common definition of privacy in The common definition of privacy in the cryptographic community limits the cryptographic community limits the information that is leaked by the the information that is leaked by the distributed computation to be the distributed computation to be the information that can be learned from information that can be learned from the designated output of the the designated output of the computation.computation.

Page 16: Data mining and privacy preserving in data mining

Specific data Specific data mining:Applicationsmining:Applications

Page 17: Data mining and privacy preserving in data mining

What data mining has done What data mining has done for….. for…..

• The US Internal Revenue Service The US Internal Revenue Service needed to improve customer service needed to improve customer service and...and...

• Scheduled its workforce to provide Scheduled its workforce to provide faster, more accurate answers to faster, more accurate answers to

questionsquestions

Page 18: Data mining and privacy preserving in data mining

What data mining has done What data mining has done for…………for…………

• The US Drug Enforcement Agency needed The US Drug Enforcement Agency needed to be more effective in their drug “busts” to be more effective in their drug “busts” andand

• analyzed suspects’ cell phone usage to analyzed suspects’ cell phone usage to focus investigations.focus investigations.

Page 19: Data mining and privacy preserving in data mining

What data mining has done What data mining has done for……..for……..

• HSBC need to cross-sell more HSBC need to cross-sell more

effectively by identifying profiles effectively by identifying profiles

that would be interested in higherthat would be interested in higher

yielding investments and...yielding investments and...

• Reduced direct mail costs by 30% while Reduced direct mail costs by 30% while garnering 95% of the campaign’s garnering 95% of the campaign’s revenue.revenue.

Page 20: Data mining and privacy preserving in data mining

Final commentsFinal comments

• Data Mining can be utilized in any Data Mining can be utilized in any organization that needs to find organization that needs to find patterns or relationships in their data.patterns or relationships in their data.

• By using the CRISP-DM methodology, By using the CRISP-DM methodology, analysts can have a reasonable level analysts can have a reasonable level of assurance that their Data Mining of assurance that their Data Mining efforts will render useful, repeatable, efforts will render useful, repeatable, and valid results.and valid results.

Page 21: Data mining and privacy preserving in data mining

5 dimensions of PPDM5 dimensions of PPDM

(1) the distribution of the(1) the distribution of the

basic data, basic data,

(2) how basic data are modified(2) how basic data are modified

(3) which mining method is being used(3) which mining method is being used

(4) If basic data or rules are to be hidden (4) If basic data or rules are to be hidden and and

(5) which additional methods for privacy (5) which additional methods for privacy preservation are usedpreservation are used

Page 22: Data mining and privacy preserving in data mining

Privacy guidelinesPrivacy guidelines

1. Collection limitation principle – too 1. Collection limitation principle – too general to be enforced in PPDMgeneral to be enforced in PPDM

2. Data quality principle – most of today’s 2. Data quality principle – most of today’s PPDM methods or algorithms assume thatPPDM methods or algorithms assume that

data are already prepared to an data are already prepared to an appropriate quality to be minedappropriate quality to be mined

3. Purpose specification principle – 3. Purpose specification principle – extremely relevant for PPDMextremely relevant for PPDM

Page 23: Data mining and privacy preserving in data mining

4. Use limitation principle – 4. Use limitation principle – fundamental for PPDMfundamental for PPDM

5. Security safeguard principle – 5. Security safeguard principle – unenforceable in the context of PPDMunenforceable in the context of PPDM

6. Openness principle – relevant for 6. Openness principle – relevant for PPDMPPDM

Page 24: Data mining and privacy preserving in data mining

7. Individual participation principle - 7. Individual participation principle - Oliveira and Zaïane suggest that the Oliveira and Zaïane suggest that the implications of this principle for PPDM implications of this principle for PPDM should be carefully weighed in light of should be carefully weighed in light of the ownership of the basic data the ownership of the basic data otherwise the application could be too otherwise the application could be too rigid in PPDM applications.rigid in PPDM applications.

8. Accountability principle – too general 8. Accountability principle – too general for PPDMfor PPDM