Upload
needa-multani
View
2.684
Download
4
Embed Size (px)
DESCRIPTION
Citation preview
Data mining is……Data mining is……
•Data mining is the process of analyzing data Data mining is the process of analyzing data from different perspectives and summarizing it from different perspectives and summarizing it into useful information - information that can be into useful information - information that can be used to increase revenue, cuts costs, or both. used to increase revenue, cuts costs, or both.
•A user-centric, interactive process which A user-centric, interactive process which leverages analysis technologies and computing leverages analysis technologies and computing powerpower
•A group of techniques that find relationships A group of techniques that find relationships that have not previously been discoveredthat have not previously been discovered
•Not reliant on an existing databaseNot reliant on an existing database
•A relatively easy task that requires knowledge A relatively easy task that requires knowledge of the business problem/subject matter expertiseof the business problem/subject matter expertise
Data mining is Data mining is notnot
•““Blind” application of algorithmsBlind” application of algorithms•Going to find relationships where Going to find relationships where none existnone exist•Presenting data in different waysPresenting data in different ways•A database intensive taskA database intensive task•A difficult to understand technology A difficult to understand technology requiring an advanced degree in requiring an advanced degree in computer sciencecomputer science
Evolutionary Step Business Question EnablingTechnologies
Product Providers Characteristics
Data Collection(1960s)
"What was my totalrevenue in the lastfive years?"
Computers, tapes,disks
IBM, CDC Retrospective,static data delivery
Data Access(1980s)
"What were unitsales in NewEngland lastMarch?"
Relationaldatabases(RDBMS),Structured QueryLanguage (SQL),ODBC
Oracle, Sybase,Informix, IBM,Microsoft
Retrospective,dynamic datadelivery at recordlevel
Data Warehousing& DecisionSupport(1990s)
"What were unitsales in NewEngland lastMarch? Drill downto Boston."
On-line analyticprocessing(OLAP),multidimensionaldatabases, datawarehouses
SPSS, Comshare,Arbor, Cognos,Microstrategy,NCR
Retrospective,dynamic datadelivery at multiplelevels
Data Mining(Emerging Today)
"What’s likely tohappen to Bostonunit sales nextmonth? Why?"
Advancedalgorithms,multiprocessorcomputers, massivedatabases
SPSS/Clementine,Lockheed, IBM,SGI, SAS, NCR,Oracle, numerousstartups
Prospective,proactiveinformationdelivery
The Evolution of Data Analysis
Results for data mining…….Results for data mining…….
• Forecasting what may happen in the Forecasting what may happen in the futurefuture
• Classifying people or things into Classifying people or things into groups by recognizing patternsgroups by recognizing patterns
• Clustering people or things into Clustering people or things into groups based on their attributesgroups based on their attributes
• Associating what events are likely to Associating what events are likely to occur togetheroccur together
• Sequencing what events are likely to Sequencing what events are likely to lead to later eventslead to later events
Why Should There be a Why Should There be a Standard Process?Standard Process?
•Framework for recording Framework for recording experienceexperience
– Allows projects to be Allows projects to be replicatedreplicated
•Aid to project planning and Aid to project planning and managementmanagement
•““Comfort factor” for new Comfort factor” for new adoptersadopters
– Demonstrates maturity of Demonstrates maturity of Data MiningData Mining
– Reduces dependency on Reduces dependency on “stars”“stars”
The data mining process must be reliable and The data mining process must be reliable and repeatable by people with little data mining repeatable by people with little data mining background.background.
Data mining vs OLAPData mining vs OLAP
OLAP - On-line Analytical Processing
Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
Data mining vs statistical Data mining vs statistical analysisanalysis•Data MiningData Mining
– Originally developed to act Originally developed to act as expert systems to solve as expert systems to solve problemsproblems
– Less interested in the Less interested in the mechanics of the mechanics of the techniquetechnique
– If it makes sense then let’s If it makes sense then let’s use ituse it
– Does not require Does not require assumptions to be made assumptions to be made about dataabout data
– Can find patterns in very Can find patterns in very large amounts of datalarge amounts of data
– Requires understanding of Requires understanding of data and business problemdata and business problem
•Data AnalysisData Analysis– Tests for statistical Tests for statistical
correctness of modelscorrectness of models• Are statistical Are statistical
assumptions of models assumptions of models correct?correct?
– Eg Is the R-Square Eg Is the R-Square good?good?
– Hypothesis testingHypothesis testing• Is the relationship Is the relationship
significant?significant?– Use a t-test to validate Use a t-test to validate
significancesignificance– Tends to rely on samplingTends to rely on sampling– Techniques are not optimised Techniques are not optimised
for large amounts of datafor large amounts of data– Requires strong statistical Requires strong statistical
skillsskills
Examples of what people Examples of what people are doing with data miningare doing with data mining•Fraud/Non-Compliance Fraud/Non-Compliance Anomaly detectionAnomaly detection
– Isolate the factors that Isolate the factors that lead to fraud, waste and lead to fraud, waste and abuseabuse
– Target auditing and Target auditing and investigative efforts investigative efforts more effectivelymore effectively
•Credit/Risk ScoringCredit/Risk Scoring•Intrusion detection Intrusion detection •Parts failure prediction Parts failure prediction
•Recruiting/Attracting Recruiting/Attracting customers customers •Maximizing profitability Maximizing profitability (cross selling, identifying (cross selling, identifying profitable customers) profitable customers) •Service Delivery and Service Delivery and Customer Retention Customer Retention
– Build profiles of Build profiles of customers likely to use customers likely to use which serviceswhich services
•Web MiningWeb Mining
Why Should There be a Why Should There be a Standard Process?Standard Process?
•Framework for recording Framework for recording experienceexperience
– Allows projects to be Allows projects to be replicatedreplicated
•Aid to project planning and Aid to project planning and managementmanagement
•““Comfort factor” for new Comfort factor” for new adoptersadopters
– Demonstrates maturity of Demonstrates maturity of Data MiningData Mining
– Reduces dependency on Reduces dependency on “stars”“stars”
The data mining process must be reliable and The data mining process must be reliable and repeatable by people with little data mining repeatable by people with little data mining background.background.
• Data mining is extensively used for knowledge Data mining is extensively used for knowledge discovery from large databases. discovery from large databases.
• The problem with data mining is that with the The problem with data mining is that with the availability of non-sensitive information, one is availability of non-sensitive information, one is able to infer sensitive information that is not to be able to infer sensitive information that is not to be disclosed. disclosed.
• Thus privacy is becoming an increasingly Thus privacy is becoming an increasingly important issue in many data mining applications.important issue in many data mining applications.
• This has led to the development of privacy This has led to the development of privacy preserving data mining.preserving data mining.
What is Data Mining and Privacy What is Data Mining and Privacy Preservation all about:Preservation all about:
• Perform Data-mining on union of two Perform Data-mining on union of two private databasesprivate databases
• Data stays private i.e. no party Data stays private i.e. no party learns anything but outputlearns anything but output
• meeting privacy requirements. Study of meeting privacy requirements. Study of achieving some data mining goals without achieving some data mining goals without scarifying the privacy of the individuals scarifying the privacy of the individuals
• providing valid data mining results.providing valid data mining results.
ObjectiveObjective
How Do We Do It?How Do We Do It?
There are two approaches to preserve privacy:There are two approaches to preserve privacy:
• The first approach protects the privacy of the The first approach protects the privacy of the data by using an extended role based access data by using an extended role based access control approach where sensitive objects control approach where sensitive objects identification is used to protect an individual’s identification is used to protect an individual’s privacy. privacy.
• The second approach uses cryptographic The second approach uses cryptographic techniques.techniques.
Cryptographic techniques Cryptographic techniques for PPDMfor PPDM• To run the data mining algorithm on the To run the data mining algorithm on the
union of their databases without union of their databases without revealing any unnecessary information.revealing any unnecessary information.
• consider separate medical institutions consider separate medical institutions that wish to conduct a joint research that wish to conduct a joint research while preserving the privacy of their while preserving the privacy of their patients.patients.
• Protection of privileged information, Protection of privileged information, along with its use for research.along with its use for research.
How much privacy?How much privacy?
• if a data mining algorithm is run against if a data mining algorithm is run against the union of the databases, and its the union of the databases, and its output becomes known to one or more output becomes known to one or more of the parties, it reveals something of the parties, it reveals something about the contents of the other about the contents of the other databases.databases.
• leak of information is inevitable, leak of information is inevitable, however, if the parties need to learn this however, if the parties need to learn this output.output.
What is Cryptography?What is Cryptography?
• The common definition of privacy in The common definition of privacy in the cryptographic community limits the cryptographic community limits the information that is leaked by the the information that is leaked by the distributed computation to be the distributed computation to be the information that can be learned from information that can be learned from the designated output of the the designated output of the computation.computation.
Specific data Specific data mining:Applicationsmining:Applications
What data mining has done What data mining has done for….. for…..
• The US Internal Revenue Service The US Internal Revenue Service needed to improve customer service needed to improve customer service and...and...
• Scheduled its workforce to provide Scheduled its workforce to provide faster, more accurate answers to faster, more accurate answers to
questionsquestions
What data mining has done What data mining has done for…………for…………
• The US Drug Enforcement Agency needed The US Drug Enforcement Agency needed to be more effective in their drug “busts” to be more effective in their drug “busts” andand
• analyzed suspects’ cell phone usage to analyzed suspects’ cell phone usage to focus investigations.focus investigations.
What data mining has done What data mining has done for……..for……..
• HSBC need to cross-sell more HSBC need to cross-sell more
effectively by identifying profiles effectively by identifying profiles
that would be interested in higherthat would be interested in higher
yielding investments and...yielding investments and...
• Reduced direct mail costs by 30% while Reduced direct mail costs by 30% while garnering 95% of the campaign’s garnering 95% of the campaign’s revenue.revenue.
Final commentsFinal comments
• Data Mining can be utilized in any Data Mining can be utilized in any organization that needs to find organization that needs to find patterns or relationships in their data.patterns or relationships in their data.
• By using the CRISP-DM methodology, By using the CRISP-DM methodology, analysts can have a reasonable level analysts can have a reasonable level of assurance that their Data Mining of assurance that their Data Mining efforts will render useful, repeatable, efforts will render useful, repeatable, and valid results.and valid results.
5 dimensions of PPDM5 dimensions of PPDM
(1) the distribution of the(1) the distribution of the
basic data, basic data,
(2) how basic data are modified(2) how basic data are modified
(3) which mining method is being used(3) which mining method is being used
(4) If basic data or rules are to be hidden (4) If basic data or rules are to be hidden and and
(5) which additional methods for privacy (5) which additional methods for privacy preservation are usedpreservation are used
Privacy guidelinesPrivacy guidelines
1. Collection limitation principle – too 1. Collection limitation principle – too general to be enforced in PPDMgeneral to be enforced in PPDM
2. Data quality principle – most of today’s 2. Data quality principle – most of today’s PPDM methods or algorithms assume thatPPDM methods or algorithms assume that
data are already prepared to an data are already prepared to an appropriate quality to be minedappropriate quality to be mined
3. Purpose specification principle – 3. Purpose specification principle – extremely relevant for PPDMextremely relevant for PPDM
4. Use limitation principle – 4. Use limitation principle – fundamental for PPDMfundamental for PPDM
5. Security safeguard principle – 5. Security safeguard principle – unenforceable in the context of PPDMunenforceable in the context of PPDM
6. Openness principle – relevant for 6. Openness principle – relevant for PPDMPPDM
7. Individual participation principle - 7. Individual participation principle - Oliveira and Zaïane suggest that the Oliveira and Zaïane suggest that the implications of this principle for PPDM implications of this principle for PPDM should be carefully weighed in light of should be carefully weighed in light of the ownership of the basic data the ownership of the basic data otherwise the application could be too otherwise the application could be too rigid in PPDM applications.rigid in PPDM applications.
8. Accountability principle – too general 8. Accountability principle – too general for PPDMfor PPDM