73
Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73 Jaume Bacardit & Natalio Krasnogor ASAP - Interdisciplinary Optimisation Laboratory School of Computer Science Centre for Integrative Systems Biology School of Biology Centre for Healthcare Associated Infections Institute of Infection, Immunity & Inflammation University of Nottingham Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case 1 Tuesday, 30 June 2009

Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Embed Size (px)

Citation preview

Page 1: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Jaume Bacardit & Natalio Krasnogor

ASAP - Interdisciplinary Optimisation LaboratorySchool of Computer Science

Centre for Integrative Systems BiologySchool of Biology

Centre for Healthcare Associated InfectionsInstitute of Infection, Immunity & Inflammation

University of Nottingham

Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a

Protein Alphabet Reduction Study Case

1Tuesday, 30 June 2009

Page 2: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Acknowledgements(in no particular order)

Peter Siepmann Pawel Widera James Smaldon Azhar Ali Shah Jack Chaplin Enrico Glaab German Terrazas Hongqing Cao Jamie Twycross Jonathan Blake Francisco Romero-Campero Maria Franco Adam Sweetman Linda Fiaschi

(in no particular order)

School of Physics and Astronomy School of Chemistry School of Pharmacy School of Biosciences School of Mathematics School of Computer Science Centre for Biomolecular Sciences all the above at UoN

Funding From:BBSRC, EPSRC, EU, ESF, UoN

2

Con

tribu

tors

to th

e ta

lks

I will

giv

e at

BG

U

Thanks also go to:

Ben Gurion University of the Negev’s Distinguished Scientists Visitor Program

Professor Dr. Moshe Sipper

Tuesday, 30 June 2009

Page 3: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Outline

Introduction to Learning Classifier Systems and Extended Compact GA

Problem Definition Methods (ECGA, LCS, Mutual Information) Results Conclusions and further work

3Tuesday, 30 June 2009

Page 4: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Based on Various Papers J.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated

alphabet reduction for protein datasets. BMC Bioinformatics, 10(6), 2009. J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated

alphabet reduction method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007. This paper won the Bronze Medal in the THE 2007 “HUMIES” AWARDS FOR HUMAN-COMPETITIVE RESULTS PRODUCED BY GENETIC AND EVOLUTIONARY COMPUTATION. J.

J.Bacardit and N. Krasnogor. Performance and efficiency of memetic pittsburgh learning classifier systems. Evolutionary Computation, 17(3):(to appear), 2009.

J.Bacardit, E.K. Burke, and N.Krasnogor. Improving the scalability of rule-based evolutionary learning. Memetic Computing, 1(1):(to appear), 2009

J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in bioinformatics. Newsletter of ACM Special Interest Group on Genetic and Evolutionary Computation: SIGEvolution, 3(1):2-10, 2008.

4

All papers available from: www.cs.nott.ac.uk/~nxk/publications.html

Tuesday, 30 June 2009

Page 5: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Learning Classifier Systems (LCS) are one of the major families of techniques that apply evolutionary computation to machine learning tasks Machine learning: How to construct

programs that automatically improve with experience [Mitchell, 1997]

Classification task: Learning how to label correctly new instances from a domain based on a set of previously labeled instances

LCS are almost as ancient as GAs, Holland made one of the first proposals

Two of the first LCS proposals are [Holland & Reitman, 78] and [Smith, 80]

5Tuesday, 30 June 2009

Page 6: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Traditionally there have been two different paradigms of LCS The Pittsburgh approach [Smith, 80] The Michigan approach [Holland & Reitman,

78] More recently: The Iterative Rule Learning

approach [Venturini, 93]

Knowledge representations All the initial approaches were rule-based In recent years several knowledge

representations have been used in the LCS field: decision trees, synthetic prototypes, etc.

6Tuesday, 30 June 2009

Page 7: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Classification task Classification task: Learning how to label correctly new

instances from a domain based on a set of previously labeled instances

Training SetLearning

Algorithm

Inference

Engine

New Instance

Class

7Tuesday, 30 June 2009

Page 8: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Classification task

X

Y

0 1

1If (X<0.25 and Y>0.75) or (X>0.75 and Y<0.25) then

8Tuesday, 30 June 2009

Page 9: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Paradigms of LCS The Pittsburgh approach

Each individual is a complete solution to the classification problem

Traditionally this means that each individual is a variable-length set of rules

GABIL [De Jong & Spears, 93] is a well-known representative of this approach

Fitness function is based on the rule set accuracy on the training set (usually also on complexity)

9Tuesday, 30 June 2009

Page 10: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Paradigms of LCS The Pittsburgh approach

Crossover operator

Mutation operator: bit flipping Individuals are interpreted as a decision list: an ordered rule set

Parents

Offspring

1 2 3 4 5 6 7 8

Instance 1 matches rules 2, 3 and 7 Rule 2 will be usedInstance 2 matches rules 1 and 8 Rule 1 will be usedInstance 3 matches rule 8 Rule 8 will be usedInstance 4 matches no rules Instance 4 will not be classified

10Tuesday, 30 June 2009

Page 11: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Paradigms of LCS The Michigan approach

Each individual is a single ruleThe whole population cooperates to solve

the classification problemA reinforcement system is used to identify

the good rulesA GA is used to explore the search space

for more rulesXCS [Wilson, 95] is the most well-known

Michigan LCS

11Tuesday, 30 June 2009

Page 12: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Paradigms of LCS Working cycle

12Tuesday, 30 June 2009

Page 13: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Paradigms of LCS The Iterative Rule Learning approach

Each individual is a single rule Individuals compete as in a standard

GA A single GA run generates one rule

The GA is run iteratively to learn all rules that solve the problem

Instances already covered by previous rules are removed from the training set of the next iteration

13Tuesday, 30 June 2009

Page 14: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Paradigms of LCS The Iterative Rule Learning approach

HIDER System [Aguilar, Riquelme & Toro, 03]1. Input: Examples2. RuleSet = Ø3. While |Examples| > 0

1. Rule = Run GA with Examples2. RuleSet = RuleSet U Rule3. Examples = Examples \ Covered(Rule)

4. EndWhile5. Output: RuleSet

14Tuesday, 30 June 2009

Page 15: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Bioinformatics-oriented Hierarchical Evolutionary Learning (BioHel)

BioHEL [Bacardit et al., 07] is a recent learning system that applies the Iterative Rule Learning (IRL) approach to generate sets of rules

IRL was first used in EC by the SIA system [Venturini, 93]

BioHEL is strongly inspired by GAssist [Bacardit, 04], a Pittsburgh approach Learning Classifier System

15Tuesday, 30 June 2009

Page 16: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

BioHEL learning paradigm IRL has been used for many years in the ML

community, with the name of separate-and-conquer

16Tuesday, 30 June 2009

Page 17: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

BioHEL’s objective function An objective function based on the Minimum-

Description-Length (MDL) (Rissanen,1978) principle that tries to promote rules with High accuracy: not making mistakes High coverage: covering as many examples as possible

without sacrificing accuracy. Recall (TP/(TP+FN)) will be used to define coverage

Low complexity: rules as simple and general as possible The objective function is a linear combination of the three

objectives above

17Tuesday, 30 June 2009

Page 18: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

BioHEL’s objective function Intuitively, we would like to have accurate rules

covering as many examples as possible. However, in complex and inconsistent domains it is

rare to obtain such rules In these cases, easier path for evolutionary search is to

maximize accuracy at the expense of coverage Therefore, we need to enforce that the evolved rules

cover enough examples

18Tuesday, 30 June 2009

Page 19: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: BioHEL’s objective function

Three parameters define the shape of the function The choice of the coverage break is crucial for the proper performance of the

system Also, coverage term penalizes rules that do not cover a minimum percentage of

examples or that cover too many

19Tuesday, 30 June 2009

Page 20: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

BioHEL’s other characteristics Attribute list rule representation

Automatically identifying the relevant attributes for a given rule and discarding all the other ones

The ILAS windowing scheme Efficiency enhancement method, not all training points are used for each

fitness computation An explicit default rule mechanism

Generating more compact rule sets Iterative process terminates when it is impossible to evolve a rule where

the associated class is the majority class among the matched examples At this point, all remaining training instances are assigned to the default

class

Ensembles for consensus prediction Easy way of boosting robustness

20Tuesday, 30 June 2009

Page 21: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Knowledge representations Representation of XCS for binary problems:

ternary representation Ternary alphabet {0,1,#} If A1=0 and A2=1 and A3 is irrelevant class 0

01#|0

Representation of XCS for real-valued attributes: real-valued interval. XCSR [Wilson, 99]

Interval is codified with two variables: center & spread: [center-spread, center+spread]

UBR [ Stone & Bull, 03] The two bounds of the interval are codified directly with two real-

valued variables. The variable with lowest value is the lower bound, the variable with higher value is the upper bound

21Tuesday, 30 June 2009

Page 22: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Knowledge representations Representation of GABIL for nominal attributes

Predicate → Class Predicate: Conjunctive Normal Form (CNF) (A1=V1

1∨.. ∨ A1=V1

n) ∧.....∧ (An=Vn2∨.. ∨ An=Vn

m) Ai : ith attribute Vi

j : jth value of the ith attribute

The rules can be mapped into a binary string, e.g., 3 attributes with {3,5,2} values each respectively:

(A1=V11∨ A1=V1

3) ∧ (A2=V22 ∨ A2=V2

4 ∨ A2=V25) ∧

(A3=V31) 101|01011|10

22Tuesday, 30 June 2009

Page 23: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Knowledge representations Pittsburgh representations for real-valued attributes:

Rule-based: Adaptive Discretization Intervals (ADI) representation [Bacardit, 04]Intervals in ADI are build using as possible

bounds the cut-points proposed by a discretization algorithm

Search bias promotes maximally general intervals

Several discretization algorithms are used at the same time to reduce bias

23Tuesday, 30 June 2009

Page 24: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Knowledge representations Pittsburgh representations for real-valued attributes:

Decision trees [Llorà, 02] Nodes in the trees can use orthogonal or oblique criteria

24Tuesday, 30 June 2009

Page 25: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Knowledge representations Pittsburgh representations for real-valued attributes

Synthetic prototypes [Llorà, 02] Each individual is a set of synthetic instances These instances are used as the core of a nearest-neighbor

classifier

?

25Tuesday, 30 June 2009

Page 26: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Extended Compact Genetic Algorithm (ECGA)

ECGA belongs to a class of Evolutionary Algorithms called Estimation of Distribution Algorithms (EDA)

no crossover or mutation! instead a probabilistic model of the

structure of the problem is kept individuals are sampled from this probability

distribution model

26Tuesday, 30 June 2009

Page 27: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /7327

Key Idea Behind Compact GA (CGA)

Text

Text

Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.

Tuesday, 30 June 2009

Page 28: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Genes Interactions must be accounted for Approximates complex distributions by

Marginal Distribution Models (i.e. genes partitions)

Selects amongst alternative models by means of the MDL:

28

Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.

Tuesday, 30 June 2009

Page 29: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /7329

Taken from: Linkage Learning via Probabilistic Modeling in the Extended Compact Genetic Algorithm (ECGA) by Georges R. Harik, Fernando G. Lobo and Kumara Sastry. Studies in Computational Intelligence, Volume 33/2006, Springer, 2007.

Tuesday, 30 June 2009

Page 30: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Outline

Introduction to Learning Classifier Systems and Extended Compact GA

Problem Definition Methods (ECGA, LCS, Mutual Information) Results Conclusions and further work

30Tuesday, 30 June 2009

Page 31: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Protein Structure Prediction (PSP) has as goal to predict the 3D structure of a protein based on its primary sequence

Primary Sequence 3D Structure

31Tuesday, 30 June 2009

Page 32: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

PSP is a very costly process As an example, one of the best PSP

methods in the last CASP meeting, Rosetta@Home used up to 104 computing years to predict a single protein’s 3D structure

Ways to alleviate computational burden: to simplify the problem to simplify the representation used to

model the proteins32

Tuesday, 30 June 2009

Page 33: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

From Full PSP to CN prediction

Two residues of a chain are said to be in contact if their distance is less than a certain threshold

CN of a residue : number of contacts that a certain residue has

In this specific case we predict, e.g., whether the CN of a residue is smaller or higher than the middle point of the CN domain

33

ContactPrimary Sequence

Native State

Tuesday, 30 June 2009

Page 34: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

From Full PSP to SA prediction Solvent Accessibility:

Amount of surface of each residue that is exposed to the solvent (e.g. water)

Metric is normalised for each AA type

Problem is to predict whether SA is lower or higher than 25%

34Tuesday, 30 June 2009

Page 35: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

PSP is a very costly process As an example, one of the best PSP

methods in the last CASP meeting, Rosetta@Home used up to 104 computing years to predict a single protein’s 3D structure

Ways to alleviate computational burden: to simplify the problem to simplify the representation used to

model the proteins35

Tuesday, 30 June 2009

Page 36: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Primary Sequence of a protein (the amino acid type of the elements of a protein chain) is an usual target for such simplification It is composed of a quite high cardinality

alphabet of 20 symbols One example of reduction widely used in the

community is the hydrophobic-polar (HP) alphabet, reducing these 20 symbols to just two

HP representation usually is too simple, information is lost in the reduction process M. Stout, et al. Prediction of residue exposure and contact number for simplified hp lattice model proteins using

learning classifier systems. In Proceedings of the 7th International FLINS Conference on Applied Artificial Intelligence, pages 601-608. World Scientific, August 2006.

M. Stout, J. Bacardit, J.D. Hirst, N. Krasnogor, and J. Blazewicz. From hp lattice models to real proteins: coordination number prediction using learning classifier systems. In 4th European Workshop on Evolutionary Computation and Machine Learning in Bioinformatics, volume 3907 of Springer Lecture Notes in Computer Science, page 208–220, Budapest, Hungary, April 2006. Springer. ISBN 978-3-540-33237-4.

papers at: http://www.cs.nott.ac.uk/~nxk/publications.html

36Tuesday, 30 June 2009

Page 37: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Research Question:

Are there “simplified” alphabets that retain key information content while simplifying interpretation,processing time, etc?

If yes, are these alphabet general for any problem domain or domain specific?

Can we automatically generate these alphabets and tailor them to the specific domain we are predicting?

37Tuesday, 30 June 2009

Page 38: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Outline

Introduction to Learning Classifier Systems and Extended Compact GA

Problem Definition Methods (ECGA, LCS, Mutual Information) Results Conclusions and further work

38Tuesday, 30 June 2009

Page 39: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Use an (automated) information theory-driven pipeline to reduce alphabet for PSP datasets

Use the Extended Compact Genetic Algorithm (ECGA) to find a dimensionality reduction policy (guided by a fitness function based on the Mutual Information (MI) metric)

Two PSP datasets will be used as testbed: Coordination Number (CN) prediction Relative Solvent Accessibility (SA) prediction

Verify the optimized reduction policies with BioHEL, an evolutionary-computation based rule learning systemJ.Bacardit, M.Stout, J.D. Hirst, A.Valencia, R.E.Smith, and N.Krasnogor. Automated alphabet reduction for protein datasets. BMC Bioinformatics, 10(6), 2009.

J. Bacardit, M. Stout, and N. Krasnogor. A tale of human-competiveness in bioinformatics. Newsletter of ACM Special Interest Group on Genetic and Evolutionary Computation: SIGEvolution, 3(1):2-10, 2008.J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007.

All papers at: http://www.cs.nott.ac.uk/~nxk/publications.html 39

Tuesday, 30 June 2009

Page 40: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Protein dataset proposed by [Kinjo et al., 05] 1050 proteins 259768 residues

Proteins were selected from PDB-REPRDB using these conditions: Less than 30% sequence identity More than 50 residues Resolution better than 2Å No membrane proteins, no chain breaks, no non-

standard residues Crystallographic R-factor better than 20%

Dataset is partitioned into training/test sets using ten-fold cross-validation

40Tuesday, 30 June 2009

Page 41: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Instance Representation

AAiCNi

AAi+1CNi+1

AAi-1CNi-1

AAi+2CNi+2

AAi-2CNi-2

AAi+3CNi+3

AAi+4CNi+4

AAi-3CNi-3

AAi-4CNi-4

AAi-5CNi-5

AAi+5CNi+5

AAi-1,AAi,AAi+1 CNiAAi,AAi+1,AAi+2 CNi+1AAi+1,AAi+2,AAi+3 CNi+2

41Tuesday, 30 June 2009

Page 42: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /7342

Taken from: J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora, and N. Krasnogor. Automated alphabet reduction method with evolutionary algorithms for protein structure prediction. In Proceedings of the 2007 Genetic and Evolutionary Computation Conference, number ISBN 978-1-59593-697-4, pages 346-353. ACM Press, 2007.

Tuesday, 30 June 2009

Page 43: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

General Workflow of the Alphabet Reduction Pipeline

Dataset|∑|=20

ECGA

MutualInformation

Size = N

Dataset|∑|=N

BioHEL

Test set

Accuracy

Ensembleof rule sets

43Tuesday, 30 June 2009

Page 44: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: alphabet reduction strategies

Three strategies were evaluated They represent progressive levels of

sophistication Mutual Information (MI) Robust Mutual Information (RMI) Dual Robust Mutual Information (DualRMI)

Thus MI, RMI, DualRMI were used in separate experiments as the “fitness” function for the ECGA tournament phase.

44Tuesday, 30 June 2009

Page 45: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: MI strategy

There are 21 symbols (20AA+end of chain) in the alphabet

Each symbol will be assigned to a group in the chromosome used by ECGA

45Tuesday, 30 June 2009

Page 46: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: MI stragegy Objective function for MI strategy: Mutual Information

Mutual Information is a measure that quantifies the interrelationship that two discrete variables have among each other

46

X is the reduced representation of the window of residues around the target.

Y is the two-state definition fo CN or SA

Tuesday, 30 June 2009

Page 47: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: MI strategy

Steps of objective function computation for the MI strategy1. Reduction mappings are extracted from the

chromosome2. Instances of the training set are transformed into the

lower cardinality alphabet3. Mutual information between the class attribute and

the string formed by concatenating the input attributes is computed

4. This MI is assigned as the result of the evaluation function

47Tuesday, 30 June 2009

Page 48: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: MI strategy Problem of MI strategy

Mutual Information needs redundancy in order to become a good estimator

That is, each possible pattern in X and Y should be well represented in the dataset

Patterns in Y are always well represented. What happens with patterns in X in our dataset?

Our sample, despite having almost 260000 residues is too small

#letters Represented patterns

2 100%

3 97.8%

4 57.6%

5 11.3%

20 3.1E-07

48Tuesday, 30 June 2009

Page 49: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: RMI strategy In order to solve the sample size problem of the MI strategy, we use

a robust MI estimator proposed by [Cline et al., 02] Pairs of (x,y) in the dataset are scrambled That is, each x in the dataset is randomly joined to an y but the

distribution of x and y remains equal MI is computed for the scrambled dataset This process is repeated N time, and the average scrambled MI is

computed

Finally, the value for the objective function is MI – Mis Mis is an estimation of the sampling bias in the data. By subtracting

it from the original MI metric we obtain a less biased objective function

49Tuesday, 30 June 2009

Page 50: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: DualRMI strategy The next strategy is based on some observations we

did in previous work [Bacardit et al., 06] Example of a rule set for prediction CN from primary

sequence

Predicate associated to the target residue (AA) is very different from the predicates associated to the other window positions

50Tuesday, 30 June 2009

Page 51: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Methods: DualRMI strategy

Why not generating two reduced alphabets at the same time? One for the target residue One for the other residues in the window

Objective function remains unchanged

51Tuesday, 30 June 2009

Page 52: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Outline

Introduction to Learning Classifier Systems and Extended Compact GA

Problem Definition Methods (ECGA, LCS, Mutual Information) Results Conclusions and further work

52Tuesday, 30 June 2009

Page 53: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Experimental design For each problem (CN, SA) For each reduction strategy (MI, RMI, DualRMI) ECGA was run to generate alphabets of two, three,

four and five letters Afterwards, BioHEL was trained over the reduced

datasets to determine the prediction accuracy that could be obtained from each alphabet size

Comparisons are drawn

53Tuesday, 30 June 2009

Page 54: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Reduced alphabets for CNAmino acids that remain always in the same group are marked with solid rectangles

54Tuesday, 30 June 2009

Page 55: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Alphabets for CN The two-letter alphabet divides the amino-acids between

hydrophobic and polar RMI could not find a five-letter alphabet DualRMI did, but only for the target residue RMI and DualRMI have a much larger number of framed

residues, showing more robustness For DualRMI we can observe small groups of hydrophobic

residues, while all polar ones are in the same group We can also observe a strange group, GHTS, that mixes

different kind of physico-chemical properties Not explained by properties but by inherent distribution in

datasets

55Tuesday, 30 June 2009

Page 56: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

A retrospective analysis of the dataset reveals why GHTS are clustered together

We computed the proportion of residues for each amino acid type with high CN

These four residues have very similar average behavior in relation to CN

56Tuesday, 30 June 2009

Page 57: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Accuracy of CN prediction Using Biohel

Accuracy difference between the AA representation and the best reduced alphabets is 0.7%

Difference in non-significant according to t-tests

RMI and DualRMI perform similarly

57Tuesday, 30 June 2009

Page 58: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Reduced alphabets for SA

58Tuesday, 30 June 2009

Page 59: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Reduced alphabets for SA Even though SA and CN are somewhat related

structural features, the resulting alphabets are different

These alphabets contain more groups of polar residues, and less groups of hydrophobic ones (in contrast with CN)

In DualRMI and 5 letters we can observe very small groups A, EK for the target alphabet G,X for the other residues alphabet

Again, the GHTS group appears59

Tuesday, 30 June 2009

Page 60: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Analysis of average SA behavior for each AA type

The reduced alphabet matched perfectly the properties of the SA features

60Tuesday, 30 June 2009

Page 61: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Accuracy of SA prediction with BioHel Accuracy of reduced

alphabets for SA prediction

Only DualRMI managed to give a performace statistically similar to the original AA representation

Accuracy difference is 0.4%

61Tuesday, 30 June 2009

Page 62: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Comparison to Other Reduced Alphabets from the Literature and Expert-Designed Alphabets Based on

Physico-Chemical Properties

Alphabets from

the literature

Expert designed

alphabets

Alphabet Letters CN acc. SA acc. Diff. Ref.

AA 20 74.0±0.6 70.7±0.4 --- ---

DualRMI 5 73.3±0.5 70.3±0.4 0.7/0.4 This work

WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99]

SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00]

MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00]

MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06]

HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 This work

HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 This work

HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 This work

62Tuesday, 30 June 2009

Page 63: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Reduced Alphabets Comparison Automatically reduced alphabets obtain better accuracy,

but how different are the alphabets themselves? We applied again the AA-wise high CN/SA analysis Two metrics were computed

Transitions: how many times does the group index change through the list of sorted AA. The less number of changes, the more homogenous the groups are

Average range: The range of a reduction group is the difference between the minimum and maximum CN/SA of the AAs belonging to that group The smaller the average range, the more focused the reduction

groups are in relation to that structural property

63Tuesday, 30 June 2009

Page 64: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Reduced Alphabets Comparison (CN)

64Tuesday, 30 June 2009

Page 65: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /7365

Reduced Alphabets Comparison (SA)

Tuesday, 30 June 2009

Page 66: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Are the alphabets interchangeable across problems?

Can these reduced alphabets be applied to an evolutionary information-based representation?

66

Additional Results

Tuesday, 30 June 2009

Page 67: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Results: Are the alphabets interchangeable? We applied the alphabet optimized for CN to SA and vice

versa

SA alphabet is good for predicting CN, but CN alphabet obtains poor performance on SA

Reduced alphabets must always be tailored to the domain at hand

67Tuesday, 30 June 2009

Page 68: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Results Application of the reduced alphabets to an evolutionary

information-based representation So far we have used only the simple primary sequence

representation Can this process be applied to much richer (and complex)

representations? We computed the position-specific scoring matrices (PSSM)

representation of our dataset using PSI-BLAST. Each instance (9 window positions) is represented by 180 continuous variables (rather than 20+1 as originally done)

Then, we reduced this representation using our alphabets The values of each PSSM profile corresponding to amino acids in

the same reduction group are averaged

68Tuesday, 30 June 2009

Page 69: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Results

Application of reduced alphabets to a PSSM representation

Thus, we reduced the representation from 180 attributes to 45

69Tuesday, 30 June 2009

Page 70: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Results Results of learning from the reduced PSSM

representation

Accuracy difference is still less than 1% Obtained rules sets are simpler and training process

is much faster Performance levels are similar to recent works in the

literature [Kinjo et al., 05][Dor and Zhou, 07] 70

Tuesday, 30 June 2009

Page 71: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Conclusions We have proposed an automated alphabet reduction protocol

for protein datasets Protocol does not use any domain knowledge It automatically tailors the reduced datasets to the domain at hand

Our experiments show that it is possible to obtain quite reduced alphabets (5 letters) with similar performance than the original AA alphabet

Our reduced alphabets are better at CN and SA prediction than other alphabet from the literature, as they are better suited for these tasks

The findings from the protocol can be used in state-of-the-art protein representations as PSSM profiles

We found some unexpected reduction groups (GHTS) but the properties of the data showed us that this is not an artifact

71Tuesday, 30 June 2009

Page 72: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Future work Explore alternative objective evaluation functions Other robust MI estimation Explore slightly higher cardinality alphabets

Is it possible to close the accuracy gap even more? Apply this protocol to other kind of datasets

E.g. protein mutations Structural aspects defined as continuous variables, not

just discrete ones

72Tuesday, 30 June 2009

Page 73: Extended Compact Genetic Algorithms and Learning Classifier Systems for Dimensionality Reduction: a Protein Alphabet Reduction Study Case

Ben-Gurion University of the Negev - June 23rd to July 5th 2009 - Distinguished Scientist Visitor Program - Beer Sheva, Israel /73

Acknowledgements(in no particular order)

Peter Siepmann Pawel Widera James Smaldon Azhar Ali Shah Jack Chaplin Enrico Glaab German Terrazas Hongqing Cao Jamie Twycross Jonathan Blake Francisco Romero-Campero Maria Franco Adam Sweetman Linda Fiaschi

(in no particular order)

School of Physics and Astronomy School of Chemistry School of Pharmacy School of Biosciences School of Mathematics School of Computer Science Centre for Biomolecular Sciences all the above at UoN

Funding From:BBSRC, EPSRC, EU, ESF, UoN

73

Con

tribu

tors

to th

e ta

lks

I will

giv

e at

BG

U

Thanks also go to:

Ben Gurion University of the Negev’s Distinguished Scientists Visitor Program

Professor Dr. Moshe Sipper

Tuesday, 30 June 2009