Combining Data Mining and Machine Learning for Effective User Profiling

Combining Data Mining and Machine Learning for Effective User Profiling

Saturday, 14 May 2016

Wealth of data/information, Lack of knowledge

The databases are more and more large• Terrorbytes!

A deluge of data, containing a lot of hidden information• new knowledge

What are the technological motivations?• Technologies to collect data• Bar code readers, scanners, cameras, etc..• Technologies to store data• Databases, data warehouses, other repositories• Network (Web) as computing and storage platform

An example of data deluge:• the WEB and SOCIAL MEDIA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused• Web data, e-commerce• Purchases at department/ grocery stores• Bank/Credit Card transactions

Competitive Pressure is Strong• Use Data Mining to provide better, customized services for an edge (e.g.

in Customer Relationship Management)

Why Mine Data? Scientific ViewpointData collected and stored at enormousspeeds (GB/hour)• remote sensors on a satellite• telescopes scanning the skies• microarrays generating gene expression data• scientific simulations generating terabytes of data

Traditional techniques infeasible for raw dataData mining may help scientists: • in classifying and segmenting data• in Hypothesis Formation

What is Data MiningData mining (Many Definitions)Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Data mining: a misnomer?It should be pattern mining in analogy to gold mining

Alternative names:Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Origins of Data Mining• Draws ideas from machine learning/AI, pattern recognition, statistics,

database systems, HPC• Traditional Techniques may be unsuitable due to

1. Enormity of data2. High dimensionality of data3. Heterogeneous, distributed nature of data

Machine Learning/ Pattern

Recognition

Statistics/ AI

Data Mining

Database systems

High Performance Computing

KDD is a process

Data Warehouse Cleansing / Selection /Transformation

Data Selection

Data Integration

Databases

Pattern Interpretation / Evaluation

– Data mining is the core of the KDD process

Data Mining

Task-relevant Data

Data Mining: On What Kind of Data?• Relational databases• Data warehouses• Transactional databases• Advanced DB and information repositories

1. Object-oriented and object-relational databases2. Spatial databases3. Time-series data and temporal data4. Text databases and multimedia databases5. Heterogeneous and legacy databases6. WWW

Web Mining applies DM to WWW

Data Mining• Often applied to structured database

Web mining• Applied to less structured data, dynamic, of huge size• Not only Web content, but also hyperlinks and access

log

Web Mining Hierarchy

Why?Data gathered from both the web and more conventional sources can be used to answer such questions as: • Marketing - those likely to buy. • Forecasts - predicting demand.• Loyalty - those likely to defect. • Credit - which were the profitable items. • Fraud - when and where they occur.

Related TermsDATA MINING PREDICTIVEANALYTIC

S

DISCOVERY AND COMMUNICATION OF MEANINGFUL PATTERNS IN DATA.

PROCESS OF DISCOVERING PATTERNS IN LARGE DATASETS USING METHODS FROM AI, MACHINE LEARNING, STATISTICS AND DATABASE SYSTEMS

TECHNIQUES FROM STATISTICS, MACHINE LEARNING AND DATA MINING IN CONJUNCTION WITH HISTORICAL AND CURRENT DATA TO MAKE PREDICTIONS ABOUT THE FUTURE.

Machine Learning

Underlying processx y

Machinelearning

algorithm

Model that approximates the underlying process

“Using data to understand an underlying process”

Underlying process {x1, x2, …}

Machinelearning

algorithm

Model that approximates theunderlying process

“Using data to understand an underlying process”

Data set 1

Model 1

Data set 2

Model 2

The created model depends on the data values used for training.

Machinelearning

algorithm

Machinelearning

algorithm

Why build a model?oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time

• Predict– A continuous value– A category label

• Find clusters in data• Identify key predictors• …

Why build a model (cont..)oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

o

oo

o

o o

Time



oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time



Why build a model (cont..)

oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time




oo

o

o o

o o o

o o

o

o oooo o

o

o o oo

o

oo

oo ox o

o oo

o x o o

o o o

oo oo

o o o

oo

o

o oo o

o

o xo o

oo

o

o o

Time

– A continuous value– A category label



Training phase– The machine learning algorithm

learns from data– Output is a trained model– Time consuming– Typically involves multiple iterations

over training data

Testing or scoring phase– The trained model is used in

conjunction with new data inputs to estimate corresponding output

– Much quicker as compared to training

MACHINE LEARNING ALGORITHM

Trained model

Trainingdata

TRAINEDMODEL

Corresponding data output

New data input

Linear– OLS regression

Generalized linear– Logistic regression, GAMs

Rule based– Decision trees

Kernel-based– Support vector machines

White box– Regression family, Decision tree

familyBlack box– Neural networks

Parametric– Regression family

Non-parametric– Support vector machines, Rule based fuzzy systems

Ensemble based– Random forest, AdaBoost

Supervised– Decision trees, logistic

regressionUnsupervised

– K-means clustering, hierarchical clustering

Generative– Naïve Bayes, mixture of

GaussiansDiscriminative– Support vector machines,

logistic regression, Decision trees

Classification– Decision trees, logistic

regressionRegression (predicting acontinuous value)– OLS regression

Algorithm

Source : http://what-when-how.com/face-recognition/facial-landmark-localization-face-recognition-techniques-part-1/

Linear regression Logistic regression

Decision trees

Multi-layer perceptron

Random forest

Source : Wikipedia

Ref: http://www.saedsayad.com/logistic_regression.htm

Source : Wikipedia

http://what-when-how.com/face-recognition/facial-landmark-localization-face-recognition-techniques-part-1/

http://www.saedsayad.com/logistic_regression.htm

Which algorithm should I use ?

•Objective of analysis– Prediction of a continuous value– classification– identifying key predictors

•Data type and distribution•Computational complexity of the algorithm Data volume

Data & Analytics

Combining Data Mining and Machine Learning for Effective User Profiling