Upload
abdul-majeed
View
228
Download
0
Embed Size (px)
Citation preview
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
1/19
CLUSTERINGCLUSTERING && SIMILARITYSIMILARITY
SEARCH OVER SEQUENCESSEARCH OVER SEQUENCESSUBJECT: ADVANCED DATA BASE MANAGEMENT
SYSTEMS
SUBMITTED BY:
NAME : BINDU N V
REGNO :
BRANCH : 1ST SEM M.TECH(QIP) CS&E
COLLEGE : NMAMIT-NITTE
SUBMITTED TO:
GURURAJ BIJU
ASST.PROF
DEPT OF CS&E
NMAMIT-NITTE
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
2/19
CLUSTERINGCLUSTERING && SIMILARITYSIMILARITYSEARCH OVER SEQUENCESSEARCH OVER SEQUENCES
BY
Name : BINDU N.V
Date : 14-11-2010
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
3/19
TOPICS:TOPICS:
y CLUSTERING
y CLUSTERING ALGORITHM-BIRCH
y
SIMILARITY SEARCH OVERSEQUENCES
y ALGORITHM TO FIND SIMILAR
SEQUENCES
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
4/19
CLUSTERINGCLUSTERING
y Data mining task of finding clusters from a
given set of records
y Partition the given set of records intogroups
Such that records within a group are
similar to each other and records that
belongs to two different groups aredissimilar
y Such a group is known as a cluster
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
5/19
y Similarity between the records are measured
by distance function
y Distance function takes two input records andreturns a value that is a measure of their
similarity
y Output of clustering algorithm consists of
summarized representation of each cluster
y Summarized representation depends on the
type and shape of clusters the algorithmcomputes
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
6/19
y For example, if we have spherical clusters,
we can summarize each cluster by itscenter C and its radius R as follows
n
C = 1 ri andn i = 1
n
R = (ri- c)
i = 1
n
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
7/19
y Clustering algorithm are of two types
i) partitional :- partitions the data into kgroups
ii)hierarchical :- generates a sequence of
partitions
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
8/19
A clustering algorithmA clustering algorithm--BIRCHBIRCH
y Handles large databases
y Based on two assumptions
i)number of records is very largeii)only a limited amount of memory is
available
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
9/19
y
A user can set two parameters to control theBIRCH algorithm
i) K
-threshold on the amount of main memory
-finds out how many clusters can bemaintained in main memory
ii)
-for the radius of cluster- controls the number of clusters the
algorithm discovers
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
10/19
y
If
is small, discovers many small clustersy If is large ,discovers very few large clusters
y
A cluster is compact if its radius is smallerthan
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
11/19
y BIRCH always maintains K or fewer cluster
summaries (Ci,Ri) in main memory
y If its not possible to maintain with givenamount of memory is increased as given
below
-algorithm reads records from databasesequentially and processes them as given
below
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
12/19
1.Compute the distance between record r andeach of the existing cluster centers. Let ibe the
cluster index such that the distance between rand C
iis the smallest
2. Compute the value of the new radius Ri
the ith
cluster under the assumption that r is insertedinto it. if Ri< ,then the ith cluster remainscompact, and we assign r to the ith cluster byupdating its center and setting its radius to R
i.If
Ri> ,then the ith cluster would no longer be
compact if we insert r into it. There fore we starta new cluster containing only the record r
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
13/19
y Problem with second step is that :-
If we have already have the maximumnumber of cluster summaries k, we have toincrease the radius threshold in order to
merge existing clusters
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
14/19
SIMILARITYSEARCH OVERSEQUENCESSIMILARITYSEARCH OVERSEQUENCES
y A lot of information stored in database consistsof sequences
y To perform similarity search on these sequenceswe assume that:-
y User specifies a query sequence and wants toretrieve all data sequences that are similar toquery sequence
y Here we are interested not only in exactly
matching query sequence but also in those thatdiffer only slightly from the query sequence
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
15/19
y A data sequence X is a series of numbers
x=< x1,x2,..xk>,where k is the length of thesequence
y A subsequence Z =is obtained
from this series by deleting numbers from front
and back of this sequence X.y If we have two sequences X and Y we can define
the Euclidean norm as the distance between
the two sequences as followsII X-Y II = (xi-yi)
2
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
16/19
y Similarity search can be classified into two
types1)complete sequence matching:-query
sequence and sequences in the database have
the same length.2)subsequence matching:-query sequence is
shorter than the sequences in the database
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
17/19
An Algorithm to find similar sequencesAn Algorithm to find similar sequences
i) Simple Method:-retrieve each sequence and
compute distance and find out similarity
Disadvantage:-it retrieves every sequence
ii) High dimensional indexing method :-
y each data sequence and query sequence canbe represented as a point in a k dimensional
spacey so we can query the similar sequences by
querying the indexes
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
18/19
y Since we want to retrieve all sequences with
distance of the query sequence ,we dontuse point query,
y
Instead, we query the index with hyperrectangle that has side length 2 and query
sequence as center and we retrieve allsequences which falls within the rectangle
8/3/2019 Copy of Clustering and Similarity Search Over Sequences
19/19