Copy of Clustering and Similarity Search Over Sequences

8/3/2019 Copy of Clustering and Similarity Search Over Sequences

1/19

CLUSTERINGCLUSTERING && SIMILARITYSIMILARITY

SEARCH OVER SEQUENCESSEARCH OVER SEQUENCESSUBJECT: ADVANCED DATA BASE MANAGEMENT

SYSTEMS

SUBMITTED BY:

NAME : BINDU N V

REGNO :

BRANCH : 1ST SEM M.TECH(QIP) CS&E

COLLEGE : NMAMIT-NITTE

SUBMITTED TO:

GURURAJ BIJU

ASST.PROF

DEPT OF CS&E

NMAMIT-NITTE


2/19

CLUSTERINGCLUSTERING && SIMILARITYSIMILARITYSEARCH OVER SEQUENCESSEARCH OVER SEQUENCES

BY

Name : BINDU N.V

Date : 14-11-2010


3/19

TOPICS:TOPICS:

y CLUSTERING

y CLUSTERING ALGORITHM-BIRCH

y

SIMILARITY SEARCH OVERSEQUENCES

y ALGORITHM TO FIND SIMILAR

SEQUENCES


4/19

CLUSTERINGCLUSTERING

y Data mining task of finding clusters from a

given set of records

y Partition the given set of records intogroups

Such that records within a group are

similar to each other and records that

belongs to two different groups aredissimilar

y Such a group is known as a cluster


5/19

y Similarity between the records are measured

by distance function

y Distance function takes two input records andreturns a value that is a measure of their

similarity

y Output of clustering algorithm consists of

summarized representation of each cluster

y Summarized representation depends on the

type and shape of clusters the algorithmcomputes


6/19

y For example, if we have spherical clusters,

we can summarize each cluster by itscenter C and its radius R as follows

n

C = 1 ri andn i = 1

n

R = (ri- c)

i = 1

n


7/19

y Clustering algorithm are of two types

i) partitional :- partitions the data into kgroups

ii)hierarchical :- generates a sequence of

partitions


8/19

A clustering algorithmA clustering algorithm--BIRCHBIRCH

y Handles large databases

y Based on two assumptions

i)number of records is very largeii)only a limited amount of memory is

available


9/19

y

A user can set two parameters to control theBIRCH algorithm

i) K

-threshold on the amount of main memory

-finds out how many clusters can bemaintained in main memory

ii)

-for the radius of cluster- controls the number of clusters the

algorithm discovers


10/19

y

If

is small, discovers many small clustersy If is large ,discovers very few large clusters

y

A cluster is compact if its radius is smallerthan


11/19

y BIRCH always maintains K or fewer cluster

summaries (Ci,Ri) in main memory

y If its not possible to maintain with givenamount of memory is increased as given

below

-algorithm reads records from databasesequentially and processes them as given

below


12/19

1.Compute the distance between record r andeach of the existing cluster centers. Let ibe the

cluster index such that the distance between rand C

iis the smallest

2. Compute the value of the new radius Ri

the ith

cluster under the assumption that r is insertedinto it. if Ri< ,then the ith cluster remainscompact, and we assign r to the ith cluster byupdating its center and setting its radius to R

i.If

Ri> ,then the ith cluster would no longer be

compact if we insert r into it. There fore we starta new cluster containing only the record r


13/19

y Problem with second step is that :-

If we have already have the maximumnumber of cluster summaries k, we have toincrease the radius threshold in order to

merge existing clusters


14/19

SIMILARITYSEARCH OVERSEQUENCESSIMILARITYSEARCH OVERSEQUENCES

y A lot of information stored in database consistsof sequences

y To perform similarity search on these sequenceswe assume that:-

y User specifies a query sequence and wants toretrieve all data sequences that are similar toquery sequence

y Here we are interested not only in exactly

matching query sequence but also in those thatdiffer only slightly from the query sequence


15/19

y A data sequence X is a series of numbers

x=< x1,x2,..xk>,where k is the length of thesequence

y A subsequence Z =is obtained

from this series by deleting numbers from front

and back of this sequence X.y If we have two sequences X and Y we can define

the Euclidean norm as the distance between

the two sequences as followsII X-Y II = (xi-yi)

2


16/19

y Similarity search can be classified into two

types1)complete sequence matching:-query

sequence and sequences in the database have

the same length.2)subsequence matching:-query sequence is

shorter than the sequences in the database


17/19

An Algorithm to find similar sequencesAn Algorithm to find similar sequences

i) Simple Method:-retrieve each sequence and

compute distance and find out similarity

Disadvantage:-it retrieves every sequence

ii) High dimensional indexing method :-

y each data sequence and query sequence canbe represented as a point in a k dimensional

spacey so we can query the similar sequences by

querying the indexes


18/19

y Since we want to retrieve all sequences with

distance of the query sequence ,we dontuse point query,

y

Instead, we query the index with hyperrectangle that has side length 2 and query

sequence as center and we retrieve allsequences which falls within the rectangle


19/19

Documents

Copy of Clustering and Similarity Search Over Sequences