Upload
codepolitan
View
2.022
Download
1
Embed Size (px)
Citation preview
Combining Data Mining and Machine Learning for Effective User Profiling
Saturday, 14 May 2016
Wealth of data/information, Lack of knowledge
The databases are more and more large• Terrorbytes!
A deluge of data, containing a lot of hidden information• new knowledge
What are the technological motivations?• Technologies to collect data• Bar code readers, scanners, cameras, etc..• Technologies to store data• Databases, data warehouses, other repositories• Network (Web) as computing and storage platform
An example of data deluge:• the WEB and SOCIAL MEDIA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Why Mine Data? Commercial Viewpoint
Lots of data is being collected and warehoused• Web data, e-commerce• Purchases at department/ grocery stores• Bank/Credit Card transactions
Competitive Pressure is Strong• Use Data Mining to provide better, customized services for an edge (e.g.
in Customer Relationship Management)
Why Mine Data? Scientific ViewpointData collected and stored at enormousspeeds (GB/hour)• remote sensors on a satellite• telescopes scanning the skies• microarrays generating gene expression data• scientific simulations generating terabytes of data
Traditional techniques infeasible for raw dataData mining may help scientists: • in classifying and segmenting data• in Hypothesis Formation
What is Data MiningData mining (Many Definitions)Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
Data mining: a misnomer?It should be pattern mining in analogy to gold mining
Alternative names:Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Origins of Data Mining• Draws ideas from machine learning/AI, pattern recognition, statistics,
database systems, HPC• Traditional Techniques may be unsuitable due to
1. Enormity of data2. High dimensionality of data3. Heterogeneous, distributed nature of data
Machine Learning/ Pattern
Recognition
Statistics/ AI
Data Mining
Database systems
High Performance Computing
KDD is a process
Data Warehouse Cleansing / Selection /Transformation
Data Selection
Data Integration
Databases
Pattern Interpretation / Evaluation
– Data mining is the core of the KDD process
Data Mining
Task-relevant Data
Data Mining: On What Kind of Data?• Relational databases• Data warehouses• Transactional databases• Advanced DB and information repositories
1. Object-oriented and object-relational databases2. Spatial databases3. Time-series data and temporal data4. Text databases and multimedia databases5. Heterogeneous and legacy databases6. WWW
Web Mining applies DM to WWW
Data Mining• Often applied to structured database
Web mining• Applied to less structured data, dynamic, of huge size• Not only Web content, but also hyperlinks and access
log
Web Mining Hierarchy
Why?Data gathered from both the web and more conventional sources can be used to answer such questions as: • Marketing - those likely to buy. • Forecasts - predicting demand.• Loyalty - those likely to defect. • Credit - which were the profitable items. • Fraud - when and where they occur.
Related TermsDATA MINING PREDICTIVEANALYTIC
S
DISCOVERY AND COMMUNICATION OF MEANINGFUL PATTERNS IN DATA.
PROCESS OF DISCOVERING PATTERNS IN LARGE DATASETS USING METHODS FROM AI, MACHINE LEARNING, STATISTICS AND DATABASE SYSTEMS
TECHNIQUES FROM STATISTICS, MACHINE LEARNING AND DATA MINING IN CONJUNCTION WITH HISTORICAL AND CURRENT DATA TO MAKE PREDICTIONS ABOUT THE FUTURE.
Machine Learning
Underlying processx y
Machinelearning
algorithm
Model that approximates the underlying process
“Using data to understand an underlying process”
Underlying process {x1, x2, …}
Machinelearning
algorithm
Model that approximates theunderlying process
“Using data to understand an underlying process”
Data set 1
Model 1
Data set 2
Model 2
The created model depends on the data values used for training.
Machinelearning
algorithm
Machinelearning
algorithm
Why build a model?oo
o
o o
o o o
o o
o
o oooo o
o
o o oo
o
oo
oo ox o
o oo
o x o o
o o o
oo oo
o o o
oo
o
o oo o
o
o xo o
oo
o
o o
Time
• Predict– A continuous value– A category label
• Find clusters in data• Identify key predictors• …
Why build a model (cont..)oo
o
o o
o o o
o o
o
o oooo o
o
o o oo
o
oo
oo ox o
o oo
o x o o
o o o
oo oo
o o o
oo
o
o oo o
o
o xo o
o
oo
o
o o
Time
• Predict– A continuous value– A category label
• Find clusters in data• Identify key predictors• …
oo
o
o o
o o o
o o
o
o oooo o
o
o o oo
o
oo
oo ox o
o oo
o x o o
o o o
oo oo
o o o
oo
o
o oo o
o
o xo o
oo
o
o o
Time
• Predict– A continuous value– A category label
• Find clusters in data• Identify key predictors• …
Why build a model (cont..)
oo
o
o o
o o o
o o
o
o oooo o
o
o o oo
o
oo
oo ox o
o oo
o x o o
o o o
oo oo
o o o
oo
o
o oo o
o
o xo o
oo
o
o o
Time
• Predict– A continuous value– A category label
• Find clusters in data• Identify key predictors• …
Why build a model (cont..)
oo
o
o o
o o o
o o
o
o oooo o
o
o o oo
o
oo
oo ox o
o oo
o x o o
o o o
oo oo
o o o
oo
o
o oo o
o
o xo o
oo
o
o o
Time
– A continuous value– A category label
• Find clusters in data• Identify key predictors• …
Why build a model (cont..)
Training phase– The machine learning algorithm
learns from data– Output is a trained model– Time consuming– Typically involves multiple iterations
over training data
Testing or scoring phase– The trained model is used in
conjunction with new data inputs to estimate corresponding output
– Much quicker as compared to training
MACHINE LEARNING ALGORITHM
Trained model
Trainingdata
TRAINEDMODEL
Corresponding data output
New data input
Linear– OLS regression
Generalized linear– Logistic regression, GAMs
Rule based– Decision trees
Kernel-based– Support vector machines
White box– Regression family, Decision tree
familyBlack box– Neural networks
Parametric– Regression family
Non-parametric– Support vector machines, Rule based fuzzy systems
Ensemble based– Random forest, AdaBoost
Supervised– Decision trees, logistic
regressionUnsupervised
– K-means clustering, hierarchical clustering
Generative– Naïve Bayes, mixture of
GaussiansDiscriminative– Support vector machines,
logistic regression, Decision trees
Classification– Decision trees, logistic
regressionRegression (predicting acontinuous value)– OLS regression
Algorithm
Source : http://what-when-how.com/face-recognition/facial-landmark-localization-face-recognition-techniques-part-1/
Linear regression Logistic regression
Decision trees
Multi-layer perceptron
Random forest
Source : Wikipedia
Ref: http://www.saedsayad.com/logistic_regression.htm
Source : Wikipedia
Which algorithm should I use ?
•Objective of analysis– Prediction of a continuous value– classification– identifying key predictors
•Data type and distribution•Computational complexity of the algorithm Data volume