Upload
insa-de-lyon
View
262
Download
2
Embed Size (px)
DESCRIPTION
Talk "Extraction et gestion des connaissances" (EGC 2012)
Citation preview
Extraction de biclusters de valeurssimilaires a l’aide de l’analyse de concepts
triadiques
M. Kaytoue, S. O. Kuznetsov,
J. Macko, W. Meira Jr. et A. Napoli
Bordeaux, 31 Janvier - 3 Fevrier 2012
Extraction et Gestion des Connaissances - EGC 2012
Context
Knowledge Discovery in Databases
2 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Biclustering numerical data
Numerical data and bicluster
Given a numerical dataset (G ,M,W , I )–object/attribute data-table–
G a set of objects (lines)
M a set of attributes (columns)
W a set of values
I ⊆ G ×M ×W a relation s.t. (g ,m,w) ∈ I , written m(g) = w ,means that object g takes the value w for attribute m–simply represents data-cells–
a bicluster is a pair (A,B) with A ⊆ G and B ⊆ M.–a rectangle in the data-table–
3 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Biclustering numerical dataExample
Given a dataset (G ,M,W , I ) with
G = {g1, g2, g3, g4}
M = {m1,m2,m3,m4,m5}
W = {0, 1, 2, 6, 7, 8, 9}
and e.g. m2(g4) = 9
the bicluster ({g2, g3, g4}, {m3,m4}) can be viewed as the grayrectangle
m1 m2 m3 m4 m5
g1 1 2 2 1 6g2 2 1 1 0 6g3 2 2 1 7 6g4 8 9 2 6 7
4 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Biclustering numerical data
But... a bicluster should reflect
a local phenomena in the data: “rectangles of values”
connectedness of values: e.g. similar values
overlapping: objects/attributes may belong to several patterns
a partial order, e.g. for algorithmic issues
maximality of rectangles w.r.t. connectedness and ordering
Several types of biclusters
5 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Biclustering numerical dataSeveral applications
Collaborative filtering and recommender systems
Finding web communities
Discovery of association rules in databases
Gene expression analysis, ...
Several algorithms
Iterative Row and Column Clustering Combination
Divide and Conquer / Distribution Parameter Identification
Greedy Iterative Search / Exhaustive Bicluster Enumeration
A difficult problem generally relying on heuristics
S. C. Madeira and A. L. OliveiraBiclustering Algorithms for Biological Data Analysis: a survey.In IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004.
6 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Introducing similarityA simple similarity relation
w1 'θ w2 ⇐⇒ |w1 − w2| ≤ θ with θ ∈ R,w1,w2 ∈W
Considered type of biclusters
A bicluster (A,B) is a bicluster of similar values if
mi (gj) 'θ mk(gl), ∀gj , gl ∈ A, ∀mi ,mk ∈ B
m1 m2 m3 m4 m5
g1 1 2 2 1 6g2 2 1 1 0 6g3 2 2 1 7 6g4 8 9 2 6 7
(with θ = 2)
and maximal if no object/attribute can be added
J. Besson, C. Robardet, L. De Raedt, J.-F. BoulicautMining Bi-sets in Numerical Data.In KDID 2006: 11-23.
7 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Formal Concept Analysis (G. & W., 99)
From a formal context to a concept lattice...
m1 m2 m3
g1 × ×g2 × ×g3 × ×g4 × ×g5 × × ×
Formal concepts = maximal rectangles
... with interesting properties (and existing algorithms!)
Maximality of concepts as rectangles
Overlapping of concepts
Specialization/generalisation hierarchy
This is exactly what we need for biclustering
8 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Contribution
FCA: an interesting framework for biclustering
Use FCA for a complete, correct and non-redundant extractionof biclusters of similar values with lossless discretization
with no set similarity parameter (useful for top-k patterndiscovery)with a given similarity parameter (as in the literature)
Design an algorithm
better than its competitorscan be easily distributedcan handle several constraints (e.g. size) in the fly
A better understanding of closed numerical pattern mining
9 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Outline
1 Formal Concept Analysis (FCA)
2 A first FCA-based biclustering method
3 Algorithm TriMax
4 Experiments
5 Conclusion and perspectives
10 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Formal Concept Analysis (FCA)
In a nutshell...FCA
A data analysis theory rooted in order and lattice theory allowingto characterize formal concepts (also known as closed itemsets)
A concept in a formal context
Formal context (G ,M, I ): objects, attributes, incidence relation
Two derivations operators allowing to define formal concepts
A concept is a maximal rectangle of ×, modulo column and linepermutations
m1 m2 m3
g1 × ×g2 × ×g3 × ×g4 × ×g5 × × ×
({g3, g4, g5}, {m2,m3}) is a formal concept
11 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Formal Concept Analysis (FCA)
Triadic Concept Analysis (Lehmann &Wille, 1995)
“Extension” of FCA to ternary relation
An object has an attribute for a given condition
Triadic context (G ,M,B,Y ): objects, attributes, conditions,incidence relation
Several derivation operators allowing to characterize “triadicconcepts” as maximal cubes of ×
b1 b2 b3
m1 m2 m3
g1 ×g2 × ×g3 × ×g4 × ×g5 × ×
m1 m2 m3
g1 × × ×g2 × ×g3 × × ×g4 × ×g5 × ×
m1 m2 m3
g1 × ×g2 ×g3 × × ×g4 × ×g5 × × ×
({g3, g4, g5}, {m2,m3}, {c1, c2, c3}) is a triadic concept
12 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
1 Formal Concept Analysis (FCA)
2 A first FCA-based biclustering method
3 Algorithm TriMax
4 Experiments
5 Conclusion and perspectives
A first FCA-based biclustering method
Basic idea
Principle
Start from a numerical dataset
Build a triadic context, with same objects, same attributes, anda discretized non-lossy “numerical space” dimension
Extract triadic concepts
We show interesting links between biclusters of similarvalues and triadic concepts
14 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
A first FCA-based biclustering method
Discretization method
Interodinal scaling (existing discretization scale)
Let (G ,M,W , I ) be a numerical dataset (with W the set ofdata-values.
Now consider the setT = {[min(W ),w ],∀w ∈W } ∪ {[w ,max(W )],∀w ∈W }.
Known fact: T and all its intersections characterize any intervalof values on W .
Example
With W = {0, 1, 2, 6, 7, 8, 9}, one has
T = {[0, 0], [0, 1], [0, 2], [0, 3], ..., [1, 9], [2, 9], ..., [9, 9]}
and for example [0, 8] ∩ [2, 9] = [2, 8]
15 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
A first FCA-based biclustering method
Building a triadic contextTransformation procedure
From a numerical dataset (G ,M,W , I ), build a triadic context(G ,M,T ,Y ) such as (g ,m, t) ∈ Y ⇐⇒ m(g) ∈ t
16 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
A first FCA-based biclustering method
First contributionWe proved that there is a 1-1-correspondence between
(i) Triadic concepts of the resulting triadic context(ii) Biclusters of similar values maximal for some θ ≥ 0
Interesting facts
Efficient algorithm for concepts extraction (Data-Peeler)
L. Cerf, J. Besson, C. Robardet, J.-F. BoulicautClosed patterns meet n-ary relations.In TKDD 3(1): (2009).
This algorithm allows to handle several constraints
Top-k biclusters: Concept (A,B,C ) with high |A|, |B|, and |C |corresponds to bicluster (A,B) as a large rectangle of closevalues (by properties of interordinal scale)
This formalization allows us to design a new algorithm toextract maximal biclusters for a given parameter θ
17 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
1 Formal Concept Analysis (FCA)
2 A first FCA-based biclustering method
3 Algorithm TriMax
4 Experiments
5 Conclusion and perspectives
Algorithm TriMax
Compute all max. biclusters for a givenθ
Principle
Use another (but similar) discretization procedure to build thetriadic context based on tolerance blocks
Standard algorithms output biclusters of similar values but notnecessarily maximal
We design a new algorithm TriMax for that task
TriMax is flexible, uses standard FCA algorithms in itscore and is better than its competitors
19 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Algorithm TriMax
Finding maximal set of similar values
'θ a tolerance relation
reflexive, symmetric, but not transitive
Blocks of tolerance of W
Maximal sets of pairwise similar values are closed setsExample with θ = 1
'1 0 1 2 6 7 8 9
0 × ×1 × × ×2 × ×6 × ×7 × × ×8 × × ×9 × ×
Blocks of tolerance
{0, 1}{1, 2}{6, 7}{7, 8}{8, 9}
Renamed classes
[0, 1][1, 2][6, 7][7, 8][8, 9]
S. O. KuznetsovGalois Connections in Data Analysis: Contributions from the Soviet Era and Modern Russian Research.In Formal Concept Analysis, Foundations and Applications, 2005.
20 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Algorithm TriMax
New transformation procedure
Tolerance blocks based scaling
Compute the set C of all blocks of tolerance over W
From the numerical dataset (G ,M,W , I ), build the triadiccontext (G ,M,C ,Z ) such that (g ,m, c) ∈ Z ⇐⇒ m(g) ∈ c
Actually, we remove “useless information”
θ = 1
21 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Algorithm TriMax
Second contribution
Algorithm TriMax
Any triadic concept corresponds to a bicluster of similar values,but not necessarily maximal!
It lead us to the algorithm TriMax that:
Process each formal context (one for each block of tolerance)with any existing FCA algorithmAny resulting concept is a maximal bicluster candidate and asimple procedure allow to check maximality (this may beproblematic, but experiments show a good behaviour)Each context can be processed separately
TriMax allows a complete, correct and non redundantextraction of all maximal biclusters of similar values for auser defined similarity parameter θ
22 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
1 Formal Concept Analysis (FCA)
2 A first FCA-based biclustering method
3 Algorithm TriMax
4 Experiments
5 Conclusion and perspectives
Experiments
Trimax - settings
Implementation: C++, boost library 1.42
InClose algorithm for dyadic contexts processing
Data: gene expression data of the species Laccaria bicolor
Configuration: Intel CPU 2.54 Ghz, 8 GB RAM
24 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Experiments
Trimax - monitoring aspects
Starting with all 12 attributes, we make vary the number ofobjects, the similarity parameter θ and monitor:
Number of maximal biclusters of similar values
Execution time (in seconds)
Number of tolerance blocks
Density of the triadic context
Comparison between the number of non-maximal biclusters withthe number of maximal biclusters
Execution time profiling of the main procedures of TriMax
25 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Experiments
Trimax - experimental results
Nr. of max. biclusters Execution times in sec. Nr. of blocks of toler.
Density of 3-adic cont. Nr. generated of biclusters Execution time
26 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Experiments
TriMax bottleneck
Computing the modus is problematic...
builds of formal context (2D) for each block of tolerance
extracts concepts (A,B) for each of them
computes the modus C to get triadic concept (A,B,C ) andcheck maximality
But...
In many applications, experts have preferences
One can remove a bicluster candidate before moduscomputation according to some constraints
Example with θ = 33, 000, 500 objects, 12 attributes
104, 226 maximal biclusters extracted in 16.130 sec
5, 332 maximal biclusters in 2.1 sec with at least 10 (at last 40)objects
27 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Experiments
Comparison
Existing algorithms
Numerical Biset Miner (NBS-Miner) - not scalable
J. Besson, C. Robardet, L. De Raedt, J.-F. BoulicautMining Bi-sets in Numerical Data.In KDID 2006: 11-23.
Interval Pattern Structures (IPS) - less efficient than TriMax
M. Kaytoue, S. O. Kuznetsov, and A. NapoliBiclustering Numerical Data in Formal Concept Analysis.ICFCA, Springer, 2011.
28 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
Experiments
An example of comparison
Increasing number of objects and all 12 attributes.Results in milliseconds.
θ = 0 θ = 700 θ = 10000
Other scenarii show a similar behaviour.
29 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N
1 Formal Concept Analysis (FCA)
2 A first FCA-based biclustering method
3 Algorithm TriMax
4 Experiments
5 Conclusion and perspectives
Conclusion and perspectives
ConclusionContribution
A better understanding of closed numerical pattern miningwithin FCA
A formal characterization of a type of bicluster
TriMax for efficient computation
Perspectives
top-k bicluster discovery
n-dimensional numerical datasets
Distributed computation
Constraints (size, mean-square residue, etc.)
Links with Fuzzy FCA
31 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques
N