26
Combining Data Mining and Machine Learning for Effective User Profiling Saturday, 14 May 2016

Combining Data Mining and Machine Learning for Effective User Profiling

Embed Size (px)

Citation preview

Page 1: Combining Data Mining and Machine Learning for Effective User Profiling

Combining Data Mining and Machine Learning for Effective User Profiling

Saturday, 14 May 2016

Page 2: Combining Data Mining and Machine Learning for Effective User Profiling

Wealth of data/information, Lack of knowledge

The databases are more and more large• Terrorbytes!

A deluge of data, containing a lot of hidden information• new knowledge

What are the technological motivations?• Technologies to collect data• Bar code readers, scanners, cameras, etc..• Technologies to store data• Databases, data warehouses, other repositories• Network (Web) as computing and storage platform

An example of data deluge:• the WEB and SOCIAL MEDIA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Page 3: Combining Data Mining and Machine Learning for Effective User Profiling

Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused• Web data, e-commerce• Purchases at department/ grocery stores• Bank/Credit Card transactions

Competitive Pressure is Strong• Use Data Mining to provide better, customized services for an edge (e.g.

in Customer Relationship Management)

Page 4: Combining Data Mining and Machine Learning for Effective User Profiling

Why Mine Data? Scientific ViewpointData collected and stored at enormousspeeds (GB/hour)• remote sensors on a satellite• telescopes scanning the skies• microarrays generating gene expression data• scientific simulations generating terabytes of data

Traditional techniques infeasible for raw dataData mining may help scientists: • in classifying and segmenting data• in Hypothesis Formation

Page 5: Combining Data Mining and Machine Learning for Effective User Profiling

What is Data MiningData mining (Many Definitions)Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Data mining: a misnomer?It should be pattern mining in analogy to gold mining

Alternative names:Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Page 6: Combining Data Mining and Machine Learning for Effective User Profiling

Origins of Data Mining• Draws ideas from machine learning/AI, pattern recognition, statistics,

database systems, HPC• Traditional Techniques may be unsuitable due to

1. Enormity of data2. High dimensionality of data3. Heterogeneous, distributed nature of data

Machine Learning/ Pattern

Recognition

Statistics/ AI

Data Mining

Database systems

High Performance Computing

Page 7: Combining Data Mining and Machine Learning for Effective User Profiling

KDD is a process

Data Warehouse Cleansing / Selection /Transformation

Data Selection

Data Integration

Databases

Pattern Interpretation / Evaluation

– Data mining is the core of the KDD process

Data Mining

Task-relevant Data

Page 8: Combining Data Mining and Machine Learning for Effective User Profiling

Data Mining: On What Kind of Data?• Relational databases• Data warehouses• Transactional databases• Advanced DB and information repositories

1. Object-oriented and object-relational databases2. Spatial databases3. Time-series data and temporal data4. Text databases and multimedia databases5. Heterogeneous and legacy databases6. WWW

Page 9: Combining Data Mining and Machine Learning for Effective User Profiling

Web Mining applies DM to WWW

Data Mining• Often applied to structured database

Web mining• Applied to less structured data, dynamic, of huge size• Not only Web content, but also hyperlinks and access

log

Page 10: Combining Data Mining and Machine Learning for Effective User Profiling

Web Mining Hierarchy

Page 11: Combining Data Mining and Machine Learning for Effective User Profiling

Why?Data gathered from both the web and more conventional sources can be used to answer such questions as: • Marketing - those likely to buy. • Forecasts - predicting demand.• Loyalty - those likely to defect. • Credit - which were the profitable items. • Fraud - when and where they occur.

Page 12: Combining Data Mining and Machine Learning for Effective User Profiling

Related TermsDATA MINING PREDICTIVEANALYTIC

S

DISCOVERY AND COMMUNICATION OF MEANINGFUL PATTERNS IN DATA.

PROCESS OF DISCOVERING PATTERNS IN LARGE DATASETS USING METHODS FROM AI, MACHINE LEARNING, STATISTICS AND DATABASE SYSTEMS

TECHNIQUES FROM STATISTICS, MACHINE LEARNING AND DATA MINING IN CONJUNCTION WITH HISTORICAL AND CURRENT DATA TO MAKE PREDICTIONS ABOUT THE FUTURE.

Page 13: Combining Data Mining and Machine Learning for Effective User Profiling
Page 14: Combining Data Mining and Machine Learning for Effective User Profiling

Machine Learning

Underlying processx y

Machinelearning

algorithm

Model that approximates the underlying process

“Using data to understand an underlying process”

Page 15: Combining Data Mining and Machine Learning for Effective User Profiling

Underlying process {x1, x2, …}

Machinelearning

algorithm

Model that approximates theunderlying process

“Using data to understand an underlying process”

Page 16: Combining Data Mining and Machine Learning for Effective User Profiling

Data set 1

Model 1

Data set 2

Model 2

The created model depends on the data values used for training.

Machinelearning

algorithm

Machinelearning

algorithm

Page 17: Combining Data Mining and Machine Learning for Effective User Profiling

Why build a model?oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time

• Predict– A continuous value– A category label

• Find clusters in data• Identify key predictors• …

Page 18: Combining Data Mining and Machine Learning for Effective User Profiling

Why build a model (cont..)oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

o

oo

o

o o

Time

• Predict– A continuous value– A category label

• Find clusters in data• Identify key predictors• …

Page 19: Combining Data Mining and Machine Learning for Effective User Profiling

oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time

• Predict– A continuous value– A category label

• Find clusters in data• Identify key predictors• …

Why build a model (cont..)

Page 20: Combining Data Mining and Machine Learning for Effective User Profiling

oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time

• Predict– A continuous value– A category label

• Find clusters in data• Identify key predictors• …

Why build a model (cont..)

Page 21: Combining Data Mining and Machine Learning for Effective User Profiling

oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time

– A continuous value– A category label

• Find clusters in data• Identify key predictors• …

Why build a model (cont..)

Page 22: Combining Data Mining and Machine Learning for Effective User Profiling

Training phase– The machine learning algorithm

learns from data– Output is a trained model– Time consuming– Typically involves multiple iterations

over training data

Testing or scoring phase– The trained model is used in

conjunction with new data inputs to estimate corresponding output

– Much quicker as compared to training

MACHINE LEARNING ALGORITHM

Trained model

Trainingdata

TRAINEDMODEL

Corresponding data output

New data input

Page 23: Combining Data Mining and Machine Learning for Effective User Profiling

Linear– OLS regression

Generalized linear– Logistic regression, GAMs

Rule based– Decision trees

Kernel-based– Support vector machines

White box– Regression family, Decision tree

familyBlack box– Neural networks

Parametric– Regression family

Non-parametric– Support vector machines, Rule based fuzzy systems

Ensemble based– Random forest, AdaBoost

Supervised– Decision trees, logistic

regressionUnsupervised

– K-means clustering, hierarchical clustering

Generative– Naïve Bayes, mixture of

GaussiansDiscriminative– Support vector machines,

logistic regression, Decision trees

Classification– Decision trees, logistic

regressionRegression (predicting acontinuous value)– OLS regression

Algorithm

Page 24: Combining Data Mining and Machine Learning for Effective User Profiling

Source : http://what-when-how.com/face-recognition/facial-landmark-localization-face-recognition-techniques-part-1/

Linear regression Logistic regression

Decision trees

Multi-layer perceptron

Random forest

Source : Wikipedia

Ref: http://www.saedsayad.com/logistic_regression.htm

Source : Wikipedia

Page 25: Combining Data Mining and Machine Learning for Effective User Profiling

Which algorithm should I use ?

•Objective of analysis– Prediction of a continuous value– classification– identifying key predictors

•Data type and distribution•Computational complexity of the algorithm Data volume

Page 26: Combining Data Mining and Machine Learning for Effective User Profiling