18
Patrick Nicolas http://patricknicolas.blogspot.com 07/13/2013

Data Modeling using Symbolic Regression

Embed Size (px)

DESCRIPTION

This is an introduction to the concept of symbolic regression for managing effectively data stream. Symbolic regression combines genetic algorithm, reinforcement learning and flexible policies to extract meaning or knowledge from data, in an ever changing environment. As the knowledge extracted from real-time data is human readable and consumable, decision makers can validate the findings of the algorithm and act appropriately. Symbolic regression is used in signal processing, process monitoring and adaptive caching in data centers.

Citation preview

Page 1: Data Modeling using Symbolic Regression

Patrick Nicolashttp://patricknicolas.blogspot.com

07/13/2013

Page 2: Data Modeling using Symbolic Regression

Need for reliability

Copyright 2013 Patrick Nicolas 2

Existing algorithms used in recommendation, predictive behavior of consumers or target advertising do not have to be very accurate: the negative impact of recommending a book, movie incorrectly or failing to detect the interest of a consumer is very limited.

However, some problems requires a far more reliable solution: failure to preserve large amount of data, detect security intrusion or predict the progress of a disease have grave consequence.

Page 3: Data Modeling using Symbolic Regression

Options

Copyright 2013 Patrick Nicolas 3

Traditional data mining approaches such as clustering (Unsupervised learning), generative or discriminative supervised learning algorithm failed to capture the evolutionary nature of a system with its states and underlying data.

Page 4: Data Modeling using Symbolic Regression

Supervised learning

Copyright 2013 Patrick Nicolas 4

Supervised learning is effective for problems with a large training compared to the dimension of the model. However it suffers from the following limitations: • Over-fitting: A supervised learning

algorithm needs a large training to account for bias in the training set

• No descriptive (human) knowledge representation

• Role of domain expert is limited to providing labeled data and validate the results.

• The model has to be retrained in case of false positive or false negative

Page 5: Data Modeling using Symbolic Regression

Unsupervised learning

Copyright 2013 Patrick Nicolas 5

Unsupervised learning methods such as Spectral Clustering, Kernel-based K-Means are used for anomaly detections or dimension reduction but have drawbacks:• Poor classification, in case of mix discrete

& continuous variables

• No descriptive knowledge representation

• Limited leverage of domain expertise: Role of the domain expert is limited to validating the cluster

• Clusters have to be rebuilt if number of outliers increases

Page 6: Data Modeling using Symbolic Regression

Symbolic Regression

Copyright 2013 Patrick Nicolas 6

Symbolic Regression addresses the key limitations of unsupervised and supervised learning methods. It combines evolutionary computation with reinforcement learning to provide domain experts a tool to create, evaluate and modify rules, policies or models.

The most commonly used algorithms in Symbolic Regression•Genetic programming•Learning Classifiers System

Page 7: Data Modeling using Symbolic Regression

Symbolic Regression

Copyright 2013 Patrick Nicolas 7

• Optimization of data archiving • Intelligent data and instrumentation

streaming• Predicting behavior of ecommerce site

during “flash” or holiday sales• Monitoring and predicting security

vulnerabilities in data centers• Distribution of network traffic and flow

in public cloud

Symbolic Regression is used in very different applications such as

Page 8: Data Modeling using Symbolic Regression

Symbolic representation

Copyright 2013 Patrick Nicolas 8

The goal is to extract knowledge from data (numerical, textual, events…) as symbolic or human readable representation using primitives or operators• Boolean operators OR, AND, XOR,..• Numerical functions Sin, Exp, Sigmoid,….• Numerical operators +, *, o, … • Differentiable operators derivative, integral,.• Logical operators: Predicate, rules,..

Domain ExpertDomain Expert

Data MiningData Mining

DataDatasin

If _ then _

_ has a

_

If _ then _

exp_ * _

Page 9: Data Modeling using Symbolic Regression

Knowledge Extraction

Copyright 2013 Patrick Nicolas 9

Knowledge extraction is the process of selecting, combining the appropriate symbolic primitives or operators to describe and predict states of a system.

Expertise

Model

Expertise

Model

sinIf _ then _

_ has a

_

If _ then _

exp_ * _f

”SystemSystem

State/DataState/Data

PredictionPrediction

Page 10: Data Modeling using Symbolic Regression

Knowledge Primitives

Copyright 2013 Patrick Nicolas 10

The generation of knowledge from a set of symbolic primitives to represent underlying state of a system is a NP problem (combinatorial explosion). Moreover computers process data in binary format (theory of information).

Value

BinaryEncoding

The solution is to represent knowledge as symbolic primitives in binary format.

Page 11: Data Modeling using Symbolic Regression

Knowledge Encoding

Copyright 2013 Patrick Nicolas 11

The most common representation is to encode symbolic primitives as sequences 0 & 1’s

f(x) = 2.sin(x) – exp(x*x)

- ( * (sin,2), o (exp, sqr))

- * o sin 2 exp sqr

long long long

Binary data

0101001001110111011101110111011101111111000111111011101101000001001000101010

Page 12: Data Modeling using Symbolic Regression

Data Modeling using Genetic Algorithm

Copyright 2013 Patrick Nicolas 12

For a given state of a system we need to find the optimal model (combination of primitives) to describe the current state using a Genetic Algorithm. The (0,1) encoding is associated to a chromosome with selection, cross-over, transposition and mutation operators

100100111011101110111011101110oo

10000010111100001010010011011

1001010111011101110100100111011

100000101111000010011011101110

Cross-over

Parents Off-springs

10010011101110111000111011101110 100100111010111101110111111100110

Mutation

10010011101110111000111011101110

Transposition

101110100100111011011101110111011s

e s

e

Page 13: Data Modeling using Symbolic Regression

Computation Flow of Genetic Algorithm

Copyright 2013 Patrick Nicolas 13

Initial Pool of Models

Initial Pool of Models

Encoding

Encoding

Initial Chromosome

s

Initial Chromosome

s

New population

New population

SelectionSelectionFitness

Fitness

Cross-overCross-over

MutationMutation

Fittest Chromosom

e

Fittest Chromosom

e

Decoding

Decoding

Best ModelBest

Model

Once the initial set of chromosomes is randomly generated the algorithm iterates until fittest chromosome emerges

Transposition

Transposition

Page 14: Data Modeling using Symbolic Regression

Limitation of Genetic Algorithm

Copyright 2013 Patrick Nicolas 14

The selection of the best chromosome representing the best classifier (or model) relies on the computation of a fitness value under the assumption that the objective does not change over time.

As most system evolves over-time, so does the objective. Reinforcement learning is used to adjust the objective using a reward/credit assignment mechanism.

Page 15: Data Modeling using Symbolic Regression

Encoding

Encoding

Concept of Reinforcement Learning

Copyright 2013 Patrick Nicolas 15

As the state of the system evolves over-time, it rewards or punishes the fittest classifier which action has been executed. The rewards or punishment is used to adjust the objective and fitness function.

System

State/Data

State/Data

ProbesProbes Effectors

Effectors RewardReward

Best ActionBest

Action

Reward Assignment

Reward Assignment

Decoding

Decoding

Genetic Algorith

m

Genetic Algorith

m

Primitives

Primitives

Best classifier

Best classifier

Page 16: Data Modeling using Symbolic Regression

Elements of Reinforcement Learning

Copyright 2013 Patrick Nicolas 16

The main challenge of reinforcement learning is to predict the impact of each action An on the global state. We need …

•Actions (or classifiers) that support logic, IF/THEN, numerical, y=f(x1, … xn) and discrete {ai} classifiers to predict the impact of a remedial action on the security of the system

1.A metric to measure the security of the overall system (distance between the current state and the baseline)

1.An actions discovery & adaptation mechanism

1.An efficient optimizer to select the best action at any state: Stochastic Descent Gradient for continuous variables {xi} only or Genetic Algorithm for mix of Boolean, Integer and Double

Page 17: Data Modeling using Symbolic Regression

Putting All Together

Copyright 2013 Patrick Nicolas 17

EnvironmentInitial Knowledg

e

Initial Knowledg

e

EncodingEncoding

Expert SupervisedLearning

Classifiers PopulationClassifiers Population

State/DataState/Data

SelectSelect

Cross-over

Cross-over

MutateMutate

ProbesProbes Effectors

Effectors RewardReward

BestClassifie

rs

BestClassifie

rs

ActionsPredicto

r

ActionsPredicto

r

ActionAction

Q-Learning

Q-Learning

Reward Assignment

Reward Assignment

Genetic AlgorithmReinforcement Learning

MatchMatch

Transpose

Transpose

Page 18: Data Modeling using Symbolic Regression

References

Copyright 2013 Patrick Nicolas 18

• Genetic Programming: On the Programming of Computers by Means of Natural Selection - J. Koza

• Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning) – R. Sutton, A. Barto

• http://www.mendeley.com/catalog/symbolic-regression-via-genetic-programming/