Cognitive data analysis Nikolay Zagoruiko Institute of Mathematics of the Siberian Devision of the Russian Academy of Sciences, Pr. Koptyg 4, 630090 Novosibirsk,

Cognitive data analysis

Nikolay Zagoruiko

Institute of Mathematics of the Siberian Devisionof the Russian Academy of Sciences,

Pr. Koptyg 4, 630090 Novosibirsk, Russia, [email protected]

Area of interestsData Analysis, Pattern Recognition, Empirical Prediction,

Discovering of Regularities, Data Mining, Machine Learning, Knowledge Discovering, Intelligence Data Analysis

Cognitive Calculations

Human-centered approach: The person - object of studying its cognitive mechanismsThe decision of new strategic tasks is impossible without the accelerated increase of an intellectual level of means of supervision, the analysis and management. The person - the subject using results of the analysis Complexity of functioning of these means and character of received results complicate understanding of results. In these conditions the person, actually, is excluded from a man-machine control system.

Specificity of DM tasks:

• Great volumes of data• Polytypic attributes• Quantity of attributes >> numbers of objects• Presence of noise and blanks• Absence of the information on distributions and

dependences

Ontology of DM

Abundance of methods is result of

absence the uniform approach

to the decision of tasks of different type

That can learn at the person?

What deciding rules the person uses?1967

* *

* *

Recognition

1 12

11

What deciding rules the person uses? 1967

* *

*

Taxonomy1

12

11

1. Person understands a results if classes are divided by the perpendicular planes

y

x

y y

x xa

X=0.8Y-3

X’

Y’

2. Person understands a results if classes are described by standards

y

x

y y

x x

*

*

*

*

*

*

*

*

X’

Y’

Уникальная способность человека распознавать трудно различимые образы основана на его умении

выбирать информативные признаки.

If at the solving of different classification tasks the person passes from one basis to another?

Most likely, peoples use

some universal psycho-physiological function

Our hypothesis:

Basic function, used by the person at the classification, recognition, feature selection etc.,

consists in measure of similarity

Functions of Similarity

2

1

21

1

21

3

41

( )

1) ( , ) 1 ( ) ,

2) ( , ) 1 | |

3) ( , ) 1 max | |,

( , )4) ( , ) ,

max( , )

5) ( , ) 1 ,....

na bi i

i

na b

i i ii

na b

i i ii

a bi i

a bni i

i a bi i i

x x

FS a b x x

FS a b x x

FS a b x x

min x xFS a b

x x

FS a b e

Similarity is not absolute, but a relative category

Is a object b close to a or it is distant?

a b



a b

a b c



a b

a b c

a b c

We should know the answer on question: In competition with what?

Function of Cоmpetitive (Rival) Similarity (FRiS)

r1

r2

-1

z

A

+1

B

F

A Bz

r1

r2

)(

)()2|1,(

12

12

rr

rrzF

A

B

A

B

A

B

A

B

All pattern recognition methods are based on hypothesis of compactness

Braverman E.M., 1962

The patterns are compact if-the number of boundary points is not enough in comparison with their common number; - compact patterns are separated from each other refer to not too elaborate borders.

Compact ness

Compact ness

Similarity between objects of one pattern should be maximal

Similarity between objects of different patterns should be minimal

r2

r1

i

A

B

j

b

r1 r2

j

j

b

br2r1

Maximal similarity between objects of the same pattern

Compact patterns should satisfy to condition of the

1

1( , | )

AM

ijA

D F j i bM

Defensive capacity:Compactness

2 1 2 1( , | ) ( ) / ( )F j i b r r r r

Tolerance:

r2

r1

r1

r2

i

A

B

j

q

sb

Compactness

Maximal difference of these objects with the objects of other patterns

1 1

1( , | )

A BM M

ii qA B

T F q s iM M

*A BC C C

Compact patterns should satisfy to the condition

( ) / 2i i iC D T

2 1 2 1( , | ) ( ) / ( )F q s i r r r r

1

1 AM

A iiA

C CM

1

1 BM

B qqB

C CM

Selection of the standards (stolps)Algorithm FRiS-Stolp

max ( ) / 2i i iC D T

Value of FRiS for points on a plane

Informativeness by Fisherfor normal distribution

1 22 21 2

| |FI

Compactness has the same sense and can be used as a criteria of informativeness, which is invariant to

low of distribution and to relation of NM

Criteria

Selection of feature

Initial set of features Xo

1, 2, 3, …..… …. j…. …..… N

Engine

GRAD

Criteria

FRiS-compactness

Variant of subset X<1,2,…,n>

Good

Bad

Algorithm GRAD

It based on combination of two greedy algorithms: forward and backward searches.

At a stage forward algorithm Addition is used J.L. Barabash, 1963

At a stage backward algorithm Deletion is used Merill T. and Green O.M., 1963

GRAD

1

0

( 1) ( 2) ( 1) ( )N n

Dj

L N N N n N j

1

0

( 1) ( 2) ( 1) ( )n

Aj

L N N N N n N j

Algorithm AdDel To easing influence of collecting errors a relaxation method it is applied.n1 - number of most informative attributes, add-on to subsystem (Add),n2<n1 - number of less informative attributes, eliminated from subsystem (Del).

AdDel Relaxation method: n steps forward - n/2 steps back

Algorithm AdDel. Reliability (R) of recognition at

different dimension space.

R(AdDel) > R(DelAd) > R(Ad) > R(Del)

GRAD

Algorithm GRAD• AdDel can work with groups of attributes (granules) of different capacity

m=1,2,3,…: , , ,…

The granules can be formed by the exhaustive search method.

• But: Problem of combinatory explosion!

Decision: orientation on individual informativeness of attributes

Dependence of frequency f hits in an informative subsystem from serial number L on individual informativeness

It allows to granulate a most informative part attributes only

GRAD

L

f

Algorithm GRAD(GRanulated AdDel)

1. Independent testing N attributes

Selection m1<<N first best (m1 granules power 1)

2. Forming combinations

Selection m2<< first best (m2 granules power 2)

3. Forming combinations

Selection m3<< first best (m3 granules power 3)

M =<m1,m2,m3> - set of secondary attributes (granules)AdDel selects m*<<|M| best granules, which included n*<<N attributes

21mC

21mC

31mC

31mC

2 6 9 25,3 ,5 , ,...X x x x x

GRAD

Comparison of the criteria (CV - FRiS)

Order of attributes by informativeness

....... ....... C = 0,661

....... ....... C = 0,883

noise0,6

0,7

0,8

0,9

1

1,1

0,05 0,1 0,15 0,2 0,25 0,3

Fs

U

Fs

U

N=100 M=2*100

mt =2*35 mC =2*65 +noise

noise

Criteria

Some real tasks

Task K M NMedicine:Diagnostics of Diabetes II type 3 43 5520 Diagnostics of Prostate Cancer 4 322 17153Recognition of type of Leukemia 2 38 7129Microarray data 2 1000 5000009 genetic tables 2 50-150 2000-12000

Physics:Complex analysis of spectra 7 20-400 1024

Commerse:Forecasting of book sealing(Data Mining Cup 2009) - 4812 1862

Recognition of two types of Leukemia - ALL and AML

ALL AMLTraining set 38 27 11 N = 7129Control set 34 20 14

I. Guyon, J. Weston, S. Barnhill, V. Vapnik Gene Selection for Cancer Classification using

Support Vector Machines. Machine Learning. 2002, 46 1-3: pp. 389-422.

Training set 38 Test set 34N g Vsuc Vext Vmed Tsuc Text Tmed P7129 0,95 0,01 0,42 0,85 -0,05 0,42 294096 0,82 -0,67 0,30 0,71 -0,77 0,34 242048 0,97 0,00 0,51 0,85 -0,21 0,41 291024 1,00 0,41 0,66 0,94 -0,02 0,47 32512 0,97 0,20 0,79 0,88 0,01 0,51 30256 1,00 0,59 0,79 0,94 0,07 0,62 32128 1,00 0,56 0,80 0,97 -0,03 0,46 3364 1,00 0,45 0,76 0,94 0,11 0,51 3232 1,00 0,45 0,65 0,97 0,00 0,39 3316 1,00 0,25 0,66 1,00 0,03 0,38 348 1,00 0,21 0,66 1,00 0,05 0,49 344 0,97 0,01 0,49 0,91 -0,08 0,45 312 0,97 -0,02 0,42 0,88 -0,23 0,44 301 0,92 -0,19 0,45 0,79 -0,27 0,23 27

Pentium T=3 hours

FRiS Decision Rules P 0,72656 537/1 , 1833/1 , 2641/2 , 4049/2 34 0,71373 1454/1 , 2641/1 , 4049/1 34 0,71208 2641/1 , 3264/1 , 4049/1 34 0,71077 435/1 , 2641/2 , 4049/2 , 6800/1 34 0,70993 2266/1 , 2641/2 , 4049/2 34 0,70973 2266/1 , 2641/2 , 2724/1 , 4049/2 34 0,70711 2266/1 , 2641/2 , 3264/1 , 4049/2 34 0,70574 2641/2 , 3264/1 , 4049/2 , 4446/1 34 0,70532 435/1 , 2641/2 , 2895/1 , 4049/2 34 0,70243 2641/2 , 2724/1 , 3862/1 , 4049/2 34

Name of gene Weight

2641/1 , 4049/1 33 2641/1 32

В 27 первых подпространствах P =34/34

Pentium T=15 sec

I.Guyon, J.Weston, S.Barnhill, V.Vapnik Zagoruiko N., Borisova I., Dyubanov V., Kutnenko O.

Best features SVM FRiS

FRE 803,4846 30(88%) 33(97%)

4846 27(79%) 30(88%)

Projection a training set on 2641 и 4049 features

AML

ALL

Comparison with 10 methods

• Jeffery I.,Higgins D.,Culhane A. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. //

• http://www.biomedcentral.com/1471-2105/7/3599 tasks on microarray data. 10 methods the feature selection.Independent attributes. Selection of n first (best). Criteria – min of errors on CV: 10 time by 50%.

Decision rules:Support Vector Machine (SVM), Between Group Analysis (BGA),

Naive Bayes Classification (NBC), K-Nearest Neighbors (KNN).

http://www.biomedcentral.com/1471-2105/7/359

Methods of selection

Methods Results

Significance analysis of microarrays (SAM) 42Analysis of variance (ANOVA) 43Empirical Bayes t-statistic 32Template matching 38 maxT 37 Between group analysis (BGA) 43 Area under the receiver operating characteristic curve (ROC) 37Welch t-statistic 39 Fold change 47 Rank products 42 FRiS-GRAD 12

Empirical Bayes t-statistic – for middle set of objectsArea under a ROC curve – for small noise and large set Rank products – for large noise and small set

Results of comperasing

• Задача N0 m1/m2 max of 4 GRAD• ALL1 12625 95/33 100.0 100.0• ALL2 12625 24/101 78.2 80.8• ALL3 12625 65/35 59.1 73.8• ALL4 12625 26/67 82.1 83.9• Prostate 12625 50/53 90.2 93.1 • Myeloma 12625 36/137 82.9 81.4• ALL/AML 7129 47/25 95.9 100.0• DLBCL 7129 58/19 94.3 93.5• Colon 2000 22/40 88.6 89.5 average 85.7 88.4

Unsettled problems

• Censoring of training set• Recognition with boundary• Stolp+corridor (FRiS+LDR)• Imputation • Associations• Unite of tasks of different types (UC+X)• Optimization of algorithms• Realization of program system (OTEX 2)• Applications (medicine, genetics,…)• …..

Conclusion

FRiS-function:1.Provides effective measure of

similarity, informativeness and compactness

2.Provides unification of methods3.Provides high quality of decisions

Publications: http://math.nsc.ru/~wwwzag

Thank you!

• Questions, please?

Decision rules Choosing a standards (stolps)

The stolp is an object which protects own objects

and does not attack another's objects

Defensive capacity:

Similarity of the objects to a stolp should be maximal a minimum of the miss of the targets, Tolerance:

Similarity of the objects to another's objects - minimally a minimum of false alarms

Stolp

Algorithm FRiS-Stolp

R2

R1

R1

R2

i

A

B

j

q

sb

Stolp

Defencive capacity: Maximal similarityof objects on stolp i

Tolerance: Maximal difference of other’s objects with stolp i

Compact patterns should satisfy to two conditions:

F(j,i)|b=(R2-R1)/(R2+R1)

1

1( , ) |

AM

jA

DCi F j i bM

1

1( , ) |

BM

qB

Ti F q s iM

1( )2i i iS DC T

R2

R1

R1

R2

i

A

B

j

q

sb

Stolp

Security: Maximal similarityof objects on stolp i

Tolerance: Maximal difference of other’s objects with stolp i

F(j,i)|b=(R2-R1)/(R2+R1)

1

1( , ) |

AM

ijA

DC F j i bM

1

1( , ) |

BM

iqB

T F q s iM

1( )2i i iS DC T

Algorithm FRiS-Stolp

Decision rulesАлгоритм FRiS-Stolp

Примеры таксономии алгоритмом FRiS-Class

Примеры таксономии алгоритмом FRiS-Class

Сравнение FRiS-Class с другими алгоритмами таксономии

0,3

0,4

0,5

0,6

0,7

0,8

0,9

2 3 4 5 6 7 8 9 10 11 12 13 14 15

FRiS-Cluster

Kmeans

Forel

Scat

FRiS-Tax

K

Documents

Cognitive data analysis Nikolay Zagoruiko Institute of Mathematics of the Siberian Devision of the Russian Academy of Sciences, Pr. Koptyg 4, 630090 Novosibirsk,