Genomic Repeat Visualisation Using Suffix Arrays

Preview:

DESCRIPTION

Genomic Repeat Visualisation Using Suffix Arrays. Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk. Repeat Visualisation Using Suffix Arrays. The Analysis Artificial Sequences Genomic Sequences The Algorithm Larger Sequences Non-genomic sequences. - PowerPoint PPT Presentation

Citation preview

Genomic Repeat Visualisation Using Suffix Arrays

Nava Whiteford

Department of ChemistryUniversity of Southampton

new@soton.ac.uk

Repeat Visualisation Using Suffix Arrays

• The Analysis

• Artificial Sequences

• Genomic Sequences

• The Algorithm

• Larger Sequences

• Non-genomic sequences

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA1 2 3

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

AT Occurs 3 time(s)

TG Occurs 1 time(s)

GC Occurs 1 time(s)

CA Occurs 1 time(s)

TA Occurs 2 time(s)

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

No. occurrences (r)

No. sequences the occur r times.

1 3

2 1

3 1

4 0

AT Occurs 3 time(s)

TG Occurs 1 time(s)

GC Occurs 1 time(s)

CA Occurs 1 time(s)

TA Occurs 2 time(s)

The repeat-score plot

Number of occurrences

Sub-string length 1

Sub-string length 2

Sub-string length 3

Sub-string length 4

Sub-string length 5

1 2 3 5 6 5

2 0 1 1 0 0

3 1 1 0 0 0

4 1 0 0 0 0

5 0 0 0 0 0

The repeat-score plot

The resulting matrix is then plotted as an image:

Repeatscore plots of Artificial Sequences

Small repeats

Reverse strand is also included

Random Sequences

DNA Sequences

• “The language of life”

• Composed of four different bases A, T, G and C

• Sequences range in size from 2000bp to 670 billion bp.

Small Genomic Sequences

Lambda Phage

Small Genomic Sequences

Lambda Phage Random Sequence

E.Coli

E.Coli

E.Coli

Sequences coding for rRNA

Known inter-genic repeat elements

E.Coli

Repeats in Genomic Sequences

A Linear time algorithm

• The plots shown would take hours to construct using traditional methods.

• The algorithms used would not scale linearly

• It is not feasible to create these plots on large sequences unless more advanced algorithms are used.

The suffix array

• banana$• anana$• nana$• ana$• na$• a$

•Original string: banana$

All suffixes

The suffix array

• banana$• anana$• nana$• ana$• na$• a$

•Original string: banana$

In sorted order

• a$• ana$• anana$• banana$• na$• nana$

All suffixes

Generating the repeatscore plot

a$ana$

anana$

banana$

na$

nana$

Generating the repeatscore plot

a$ana$

anana$

banana$

na$

nana$

Whole human genome

Whole human genome

Whole human genome

Human Chromosome 18

Arabidopsis thaliana chromosome 1, coding region

Fibonacci derived sequences

Gallus gallus chromosome 20

Application to other sequences

• Analysing writing styles

• Finding plagiarised text

• Any sequence that may contain motif based, language like structure.

Shakespeare

Text document containing the text “The quick brown fox jumped over the lazy dog” 16times.

“On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial

repeat inserted 16times.

“On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial

repeat inserted 16times.

Conclusion

• This new visualisation technique can highlight repeat structure in sequences.

• In genomic sequences this maybe useful in generating annotation.

• There are applications in other areas worth pursuing.

• Our next step is to allow the repeatscore plot to be easily interrogated by a user in order to better understand the repeat structure.