Upload
molly-wilkerson
View
216
Download
2
Embed Size (px)
Citation preview
Cognitive data analysis
Nikolay Zagoruiko
Institute of Mathematics of the Siberian Devisionof the Russian Academy of Sciences,
Pr. Koptyg 4, 630090 Novosibirsk, Russia, [email protected]
Area of interestsData Analysis, Pattern Recognition, Empirical Prediction,
Discovering of Regularities, Data Mining, Machine Learning, Knowledge Discovering, Intelligence Data Analysis
Cognitive Calculations
Human-centered approach: The person - object of studying its cognitive mechanismsThe decision of new strategic tasks is impossible without the accelerated increase of an intellectual level of means of supervision, the analysis and management. The person - the subject using results of the analysis Complexity of functioning of these means and character of received results complicate understanding of results. In these conditions the person, actually, is excluded from a man-machine control system.
Specificity of DM tasks:
• Great volumes of data• Polytypic attributes• Quantity of attributes >> numbers of objects• Presence of noise and blanks• Absence of the information on distributions and
dependences
Ontology of DM
Abundance of methods is result of
absence the uniform approach
to the decision of tasks of different type
That can learn at the person?
What deciding rules the person uses?1967
* *
* *
Recognition
1 12
11
What deciding rules the person uses? 1967
* *
*
Taxonomy1
12
11
1. Person understands a results if classes are divided by the perpendicular planes
y
x
y y
x xa
X=0.8Y-3
X’
Y’
2. Person understands a results if classes are described by standards
y
x
y y
x x
*
*
*
*
*
*
*
*
X’
Y’
Уникальная способность человека распознавать трудно различимые образы основана на его умении
выбирать информативные признаки.
If at the solving of different classification tasks the person passes from one basis to another?
Most likely, peoples use
some universal psycho-physiological function
Our hypothesis:
Basic function, used by the person at the classification, recognition, feature selection etc.,
consists in measure of similarity
Functions of Similarity
2
1
21
1
21
3
41
( )
1) ( , ) 1 ( ) ,
2) ( , ) 1 | |
3) ( , ) 1 max | |,
( , )4) ( , ) ,
max( , )
5) ( , ) 1 ,....
na bi i
i
na b
i i ii
na b
i i ii
a bi i
a bni i
i a bi i i
x x
FS a b x x
FS a b x x
FS a b x x
min x xFS a b
x x
FS a b e
Similarity is not absolute, but a relative category
Is a object b close to a or it is distant?
a b
Similarity is not absolute, but a relative category
Is a object b close to a or it is distant?
a b
a b c
Similarity is not absolute, but a relative category
Is a object b close to a or it is distant?
a b
a b c
a b c
We should know the answer on question: In competition with what?
Function of Cоmpetitive (Rival) Similarity (FRiS)
r1
r2
-1
z
A
+1
B
F
A Bz
r1
r2
)(
)()2|1,(
12
12
rr
rrzF
A
B
A
B
A
B
A
B
All pattern recognition methods are based on hypothesis of compactness
Braverman E.M., 1962
The patterns are compact if-the number of boundary points is not enough in comparison with their common number; - compact patterns are separated from each other refer to not too elaborate borders.
Compact ness
Compact ness
Similarity between objects of one pattern should be maximal
Similarity between objects of different patterns should be minimal
r2
r1
i
A
B
j
b
r1 r2
j
j
b
br2r1
Maximal similarity between objects of the same pattern
Compact patterns should satisfy to condition of the
1
1( , | )
AM
ijA
D F j i bM
Defensive capacity:Compactness
2 1 2 1( , | ) ( ) / ( )F j i b r r r r
Tolerance:
r2
r1
r1
r2
i
A
B
j
q
sb
Compactness
Maximal difference of these objects with the objects of other patterns
1 1
1( , | )
A BM M
ii qA B
T F q s iM M
*A BC C C
Compact patterns should satisfy to the condition
( ) / 2i i iC D T
2 1 2 1( , | ) ( ) / ( )F q s i r r r r
1
1 AM
A iiA
C CM
1
1 BM
B qqB
C CM
Selection of the standards (stolps)Algorithm FRiS-Stolp
max ( ) / 2i i iC D T
Value of FRiS for points on a plane
Informativeness by Fisherfor normal distribution
1 22 21 2
| |FI
Compactness has the same sense and can be used as a criteria of informativeness, which is invariant to
low of distribution and to relation of NM
Criteria
Selection of feature
Initial set of features Xo
1, 2, 3, …..… …. j…. …..… N
Engine
GRAD
Criteria
FRiS-compactness
Variant of subset X<1,2,…,n>
Good
Bad
Algorithm GRAD
It based on combination of two greedy algorithms: forward and backward searches.
At a stage forward algorithm Addition is used J.L. Barabash, 1963
At a stage backward algorithm Deletion is used Merill T. and Green O.M., 1963
GRAD
1
0
( 1) ( 2) ( 1) ( )N n
Dj
L N N N n N j
1
0
( 1) ( 2) ( 1) ( )n
Aj
L N N N N n N j
Algorithm AdDel To easing influence of collecting errors a relaxation method it is applied.n1 - number of most informative attributes, add-on to subsystem (Add),n2<n1 - number of less informative attributes, eliminated from subsystem (Del).
AdDel Relaxation method: n steps forward - n/2 steps back
Algorithm AdDel. Reliability (R) of recognition at
different dimension space.
R(AdDel) > R(DelAd) > R(Ad) > R(Del)
GRAD
Algorithm GRAD• AdDel can work with groups of attributes (granules) of different capacity
m=1,2,3,…: , , ,…
The granules can be formed by the exhaustive search method.
• But: Problem of combinatory explosion!
Decision: orientation on individual informativeness of attributes
Dependence of frequency f hits in an informative subsystem from serial number L on individual informativeness
It allows to granulate a most informative part attributes only
GRAD
L
f
Algorithm GRAD(GRanulated AdDel)
1. Independent testing N attributes
Selection m1<<N first best (m1 granules power 1)
2. Forming combinations
Selection m2<< first best (m2 granules power 2)
3. Forming combinations
Selection m3<< first best (m3 granules power 3)
M =<m1,m2,m3> - set of secondary attributes (granules)AdDel selects m*<<|M| best granules, which included n*<<N attributes
21mC
21mC
31mC
31mC
2 6 9 25,3 ,5 , ,...X x x x x
GRAD
Comparison of the criteria (CV - FRiS)
Order of attributes by informativeness
....... ....... C = 0,661
....... ....... C = 0,883
noise0,6
0,7
0,8
0,9
1
1,1
0,05 0,1 0,15 0,2 0,25 0,3
Fs
U
Fs
U
N=100 M=2*100
mt =2*35 mC =2*65 +noise
noise
Criteria
Some real tasks
Task K M NMedicine:Diagnostics of Diabetes II type 3 43 5520 Diagnostics of Prostate Cancer 4 322 17153Recognition of type of Leukemia 2 38 7129Microarray data 2 1000 5000009 genetic tables 2 50-150 2000-12000
Physics:Complex analysis of spectra 7 20-400 1024
Commerse:Forecasting of book sealing(Data Mining Cup 2009) - 4812 1862
Recognition of two types of Leukemia - ALL and AML
ALL AMLTraining set 38 27 11 N = 7129Control set 34 20 14
I. Guyon, J. Weston, S. Barnhill, V. Vapnik Gene Selection for Cancer Classification using
Support Vector Machines. Machine Learning. 2002, 46 1-3: pp. 389-422.
Training set 38 Test set 34N g Vsuc Vext Vmed Tsuc Text Tmed P7129 0,95 0,01 0,42 0,85 -0,05 0,42 294096 0,82 -0,67 0,30 0,71 -0,77 0,34 242048 0,97 0,00 0,51 0,85 -0,21 0,41 291024 1,00 0,41 0,66 0,94 -0,02 0,47 32512 0,97 0,20 0,79 0,88 0,01 0,51 30256 1,00 0,59 0,79 0,94 0,07 0,62 32128 1,00 0,56 0,80 0,97 -0,03 0,46 3364 1,00 0,45 0,76 0,94 0,11 0,51 3232 1,00 0,45 0,65 0,97 0,00 0,39 3316 1,00 0,25 0,66 1,00 0,03 0,38 348 1,00 0,21 0,66 1,00 0,05 0,49 344 0,97 0,01 0,49 0,91 -0,08 0,45 312 0,97 -0,02 0,42 0,88 -0,23 0,44 301 0,92 -0,19 0,45 0,79 -0,27 0,23 27
Pentium T=3 hours
FRiS Decision Rules P 0,72656 537/1 , 1833/1 , 2641/2 , 4049/2 34 0,71373 1454/1 , 2641/1 , 4049/1 34 0,71208 2641/1 , 3264/1 , 4049/1 34 0,71077 435/1 , 2641/2 , 4049/2 , 6800/1 34 0,70993 2266/1 , 2641/2 , 4049/2 34 0,70973 2266/1 , 2641/2 , 2724/1 , 4049/2 34 0,70711 2266/1 , 2641/2 , 3264/1 , 4049/2 34 0,70574 2641/2 , 3264/1 , 4049/2 , 4446/1 34 0,70532 435/1 , 2641/2 , 2895/1 , 4049/2 34 0,70243 2641/2 , 2724/1 , 3862/1 , 4049/2 34
Name of gene Weight
2641/1 , 4049/1 33 2641/1 32
В 27 первых подпространствах P =34/34
Pentium T=15 sec
I.Guyon, J.Weston, S.Barnhill, V.Vapnik Zagoruiko N., Borisova I., Dyubanov V., Kutnenko O.
Best features SVM FRiS
FRE 803,4846 30(88%) 33(97%)
4846 27(79%) 30(88%)
Projection a training set on 2641 и 4049 features
AML
ALL
Comparison with 10 methods
• Jeffery I.,Higgins D.,Culhane A. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. //
• http://www.biomedcentral.com/1471-2105/7/3599 tasks on microarray data. 10 methods the feature selection.Independent attributes. Selection of n first (best). Criteria – min of errors on CV: 10 time by 50%.
Decision rules:Support Vector Machine (SVM), Between Group Analysis (BGA),
Naive Bayes Classification (NBC), K-Nearest Neighbors (KNN).
Methods of selection
Methods Results
Significance analysis of microarrays (SAM) 42Analysis of variance (ANOVA) 43Empirical Bayes t-statistic 32Template matching 38 maxT 37 Between group analysis (BGA) 43 Area under the receiver operating characteristic curve (ROC) 37Welch t-statistic 39 Fold change 47 Rank products 42 FRiS-GRAD 12
Empirical Bayes t-statistic – for middle set of objectsArea under a ROC curve – for small noise and large set Rank products – for large noise and small set
Results of comperasing
• Задача N0 m1/m2 max of 4 GRAD• ALL1 12625 95/33 100.0 100.0• ALL2 12625 24/101 78.2 80.8• ALL3 12625 65/35 59.1 73.8• ALL4 12625 26/67 82.1 83.9• Prostate 12625 50/53 90.2 93.1 • Myeloma 12625 36/137 82.9 81.4• ALL/AML 7129 47/25 95.9 100.0• DLBCL 7129 58/19 94.3 93.5• Colon 2000 22/40 88.6 89.5 average 85.7 88.4
Unsettled problems
• Censoring of training set• Recognition with boundary• Stolp+corridor (FRiS+LDR)• Imputation • Associations• Unite of tasks of different types (UC+X)• Optimization of algorithms• Realization of program system (OTEX 2)• Applications (medicine, genetics,…)• …..
Conclusion
FRiS-function:1.Provides effective measure of
similarity, informativeness and compactness
2.Provides unification of methods3.Provides high quality of decisions
Publications: http://math.nsc.ru/~wwwzag
Thank you!
• Questions, please?
Decision rules Choosing a standards (stolps)
The stolp is an object which protects own objects
and does not attack another's objects
Defensive capacity:
Similarity of the objects to a stolp should be maximal a minimum of the miss of the targets, Tolerance:
Similarity of the objects to another's objects - minimally a minimum of false alarms
Stolp
Algorithm FRiS-Stolp
R2
R1
R1
R2
i
A
B
j
q
sb
Stolp
Defencive capacity: Maximal similarityof objects on stolp i
Tolerance: Maximal difference of other’s objects with stolp i
Compact patterns should satisfy to two conditions:
F(j,i)|b=(R2-R1)/(R2+R1)
1
1( , ) |
AM
jA
DCi F j i bM
1
1( , ) |
BM
qB
Ti F q s iM
1( )2i i iS DC T
R2
R1
R1
R2
i
A
B
j
q
sb
Stolp
Security: Maximal similarityof objects on stolp i
Tolerance: Maximal difference of other’s objects with stolp i
F(j,i)|b=(R2-R1)/(R2+R1)
1
1( , ) |
AM
ijA
DC F j i bM
1
1( , ) |
BM
iqB
T F q s iM
1( )2i i iS DC T
Algorithm FRiS-Stolp
Decision rulesАлгоритм FRiS-Stolp
Примеры таксономии алгоритмом FRiS-Class
Примеры таксономии алгоритмом FRiS-Class
Сравнение FRiS-Class с другими алгоритмами таксономии
0,3
0,4
0,5
0,6
0,7
0,8
0,9
2 3 4 5 6 7 8 9 10 11 12 13 14 15
FRiS-Cluster
Kmeans
Forel
Scat
FRiS-Tax
K