Copy of Clustering and Similarity Search Over Sequences

Embed Size (px)

Citation preview

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    1/19

    CLUSTERINGCLUSTERING && SIMILARITYSIMILARITY

    SEARCH OVER SEQUENCESSEARCH OVER SEQUENCESSUBJECT: ADVANCED DATA BASE MANAGEMENT

    SYSTEMS

    SUBMITTED BY:

    NAME : BINDU N V

    REGNO :

    BRANCH : 1ST SEM M.TECH(QIP) CS&E

    COLLEGE : NMAMIT-NITTE

    SUBMITTED TO:

    GURURAJ BIJU

    ASST.PROF

    DEPT OF CS&E

    NMAMIT-NITTE

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    2/19

    CLUSTERINGCLUSTERING && SIMILARITYSIMILARITYSEARCH OVER SEQUENCESSEARCH OVER SEQUENCES

    BY

    Name : BINDU N.V

    Date : 14-11-2010

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    3/19

    TOPICS:TOPICS:

    y CLUSTERING

    y CLUSTERING ALGORITHM-BIRCH

    y

    SIMILARITY SEARCH OVERSEQUENCES

    y ALGORITHM TO FIND SIMILAR

    SEQUENCES

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    4/19

    CLUSTERINGCLUSTERING

    y Data mining task of finding clusters from a

    given set of records

    y Partition the given set of records intogroups

    Such that records within a group are

    similar to each other and records that

    belongs to two different groups aredissimilar

    y Such a group is known as a cluster

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    5/19

    y Similarity between the records are measured

    by distance function

    y Distance function takes two input records andreturns a value that is a measure of their

    similarity

    y Output of clustering algorithm consists of

    summarized representation of each cluster

    y Summarized representation depends on the

    type and shape of clusters the algorithmcomputes

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    6/19

    y For example, if we have spherical clusters,

    we can summarize each cluster by itscenter C and its radius R as follows

    n

    C = 1 ri andn i = 1

    n

    R = (ri- c)

    i = 1

    n

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    7/19

    y Clustering algorithm are of two types

    i) partitional :- partitions the data into kgroups

    ii)hierarchical :- generates a sequence of

    partitions

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    8/19

    A clustering algorithmA clustering algorithm--BIRCHBIRCH

    y Handles large databases

    y Based on two assumptions

    i)number of records is very largeii)only a limited amount of memory is

    available

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    9/19

    y

    A user can set two parameters to control theBIRCH algorithm

    i) K

    -threshold on the amount of main memory

    -finds out how many clusters can bemaintained in main memory

    ii)

    -for the radius of cluster- controls the number of clusters the

    algorithm discovers

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    10/19

    y

    If

    is small, discovers many small clustersy If is large ,discovers very few large clusters

    y

    A cluster is compact if its radius is smallerthan

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    11/19

    y BIRCH always maintains K or fewer cluster

    summaries (Ci,Ri) in main memory

    y If its not possible to maintain with givenamount of memory is increased as given

    below

    -algorithm reads records from databasesequentially and processes them as given

    below

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    12/19

    1.Compute the distance between record r andeach of the existing cluster centers. Let ibe the

    cluster index such that the distance between rand C

    iis the smallest

    2. Compute the value of the new radius Ri

    the ith

    cluster under the assumption that r is insertedinto it. if Ri< ,then the ith cluster remainscompact, and we assign r to the ith cluster byupdating its center and setting its radius to R

    i.If

    Ri> ,then the ith cluster would no longer be

    compact if we insert r into it. There fore we starta new cluster containing only the record r

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    13/19

    y Problem with second step is that :-

    If we have already have the maximumnumber of cluster summaries k, we have toincrease the radius threshold in order to

    merge existing clusters

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    14/19

    SIMILARITYSEARCH OVERSEQUENCESSIMILARITYSEARCH OVERSEQUENCES

    y A lot of information stored in database consistsof sequences

    y To perform similarity search on these sequenceswe assume that:-

    y User specifies a query sequence and wants toretrieve all data sequences that are similar toquery sequence

    y Here we are interested not only in exactly

    matching query sequence but also in those thatdiffer only slightly from the query sequence

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    15/19

    y A data sequence X is a series of numbers

    x=< x1,x2,..xk>,where k is the length of thesequence

    y A subsequence Z =is obtained

    from this series by deleting numbers from front

    and back of this sequence X.y If we have two sequences X and Y we can define

    the Euclidean norm as the distance between

    the two sequences as followsII X-Y II = (xi-yi)

    2

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    16/19

    y Similarity search can be classified into two

    types1)complete sequence matching:-query

    sequence and sequences in the database have

    the same length.2)subsequence matching:-query sequence is

    shorter than the sequences in the database

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    17/19

    An Algorithm to find similar sequencesAn Algorithm to find similar sequences

    i) Simple Method:-retrieve each sequence and

    compute distance and find out similarity

    Disadvantage:-it retrieves every sequence

    ii) High dimensional indexing method :-

    y each data sequence and query sequence canbe represented as a point in a k dimensional

    spacey so we can query the similar sequences by

    querying the indexes

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    18/19

    y Since we want to retrieve all sequences with

    distance of the query sequence ,we dontuse point query,

    y

    Instead, we query the index with hyperrectangle that has side length 2 and query

    sequence as center and we retrieve allsequences which falls within the rectangle

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    19/19