Part I – Sequence analysis (DNA)bio-mirror.im.ac.cn/mirrors/s-star/downloads/tutorial/t2a.pdf ·...

Preview:

Citation preview

1

Part I – Sequence analysis (DNA) : Bioinformatics Software

Chen XinNational University of Singapore

Bioinformatics software

Its role in research:

n Hypothesis-driven research cycle in biology (From Kitano H. Systems biology: a brief overview. Science 2002, 295:1662-4)

2

Bioinformatics software

n Cyclical refinement of predictive computer models used to define further biological experiments, including the optimization step.

n From Brusic et al. 2001, Efficient discovery of immune response targets by cyclical refinement of QSAR models of peptide binding. J. Mol. Graph. Model. 19:405-11, 467

Bioinformatics software

n By combining computational methods with experimental biology, major discoveries can be made faster and more efficiently.

n Today, every large molecular or systems biology project has a bioinformatics component.

n Use of biological software allows biologists to extend their set of skills for more efficient and more effective analysis of their data, and for planning of experiments.

3

Genetic information

n Genetic information carriern DNA or RNA

n Genetic information carriedn Sequence

n Hence:

Life = Life = f f (Sequence)(Sequence)

New drug discovery

A drug = Target identification -> Lead discovery ->Lead optimization -> animal trial -> clinical trail

Target:Key to disease developmentSpecific to disease developmentSequence, Sample protein, 3D structure …

4

DNA sequence analysisTypes of analysis:n GC contentn Pattern analysisn Translation (Open Reading Frame detection)n Gene findingn Mutationn Primer designn Restriction map n ……

When you have a sequence

n Is it likely to be a gene?n What is the possible expression level?n What is the possible protein product?n Can we get the protein product?n Can we figure out the key residue in the

protein product?n ……

5

GC content

n Stabilityn GC: 3 hydrogen bondsn AT: 2 hydrogen bonds

n Codon preferencen GC rich fragment

à Gene

GC Content

n CpG islandn Resistance to methylationn Associated with genes which are frequently

switched onn Estimate: ½ mammalian gene have CpG

islandn Most mammalian housekeeping genes

have CpG island at 5’ end

6

GC content

n GC Content:n Emboss -> CompSeqn Emboss -> GEECEEn Bioedit

n CpG Island:n http://l25.itba.mi.cnr.it/genebin/wwwcpg.pl (Italy)n Emboss -> CpGReport

Pattern analysis

n Patterns in the sequencen Associated with certain biological

functionn Transcription factor bindingn Transcription startingn Transcription endingn Splicingn ……

7

Gene finding

n A kind of pattern searchn Gene structure

n Promoter, Exon, Intronn Promoter: TATA box (TATAAT)n Exon: Open Reading Frame (ORF)n Intron: Only eukaryotes, have splicing signaln Other motifs

Gene

Picture from the LSM2104 Practical, V.B. LIT

8

Gene finding

n Most of the programs focused on Open reading framen Emboss -> GetORFn Emboss -> ShowORF

n Other important elements:n Matrix binding site: Emboss -> MarScann Promoter region: PromoterInspectorn Splicing sites: GeneSplicer

Gene finding

n Prokaryotesn No intron n Long open reading framen High densityn Easy to detect

n Eukaryotesn Have intron n Combination of short open reading framesn Low densityn Hard to detect

9

Problem 1:

n Is it a gene?n Not sure, but have some confidence

n What is the expression level if it is a gene?n Determined by the promoter and other

upper stream elements

Translation

n Six reading framesn Open reading frame (ORF)

n Start codonn Stop codonn Certain length

n Tools: ShowORF

10

Conceptual translation

5’ AATGGCAATCCGCGTAGACTAGGCA 3’

3’ TTACCGTTAGGCGCATCTGTATCGT 5’

AATAATGGCGGCAATAATCCGCCGCGTCGTAGAAGACTACTAGGCGGCAAAAATGATGGCAGCAATCATCCGCCGCGTAGTAGACGACTAGTAGGCAGCAAAAATGGTGGCAACAATCCTCCGCGGCGTAGTAGACTACTAGGAGGCACA

TTTACTACCGTCGTTAGTAGGCGGCGCATCATCTGCTGTATTATCGTCGT

TTTTACCACCGTTGTTAGGAGGCGCCGCATCATCTGTTGTATCATCGTGTTTATTACCGCCGTTATTAGGCGGCGCAGCATCTTCTGTAGTATCGTCGTT

+1

-2-3

-1

+2+3

Six reading frames

AATAATGGCGGCAATAATCCGCCGCGTCGTAGAAGACTACTAGGCGGCAAN G N P R R L G N G N P R R L G

AAAATGGTGGCAACAATCCTCCGCGGCGTAGTAGACTACTAGGAGGCACAW Q S A * T RW Q S A * T R

TTTACTACCGTCGTTAGTAGGCGGCGCATCATCTGCTGTATTATCGTCGT

TTTTACCACCGTTGTTAGGAGGCGCCGCATCATCTGTTGTATCATCGTGTTTATTACCGCCGTTATTAGGCGGCGCAGCATCTTCTGTAGTATCGTCGTT

+1

-2-3

-1

+3

AAATGATGGCAGCAATCATCCGCCGCGTAGTAGACGACTAGTAGGCAGCAM A I R V D * AM A I R V D * A

+2

11

Problem 2

n What is the possible product of this gene?n It is likely to be ….n This conceptual translation is in open

reading frame ……n Can we get the gene product?

n If expression level high: Directly separaten If expression level low: Clone it

Recom

binant DN

A

12

Primer design

n Design primers only from accurate sequence data

n Restrict your search to regions that best reflect your goals

n Locate candidate primersn Verification of your choice

Primer design

n (primer 1) CTAGTACGATn ATGCCGTAGATC……TCCGATCATGCTAn TACGGCATCTAG……AGGCTAGTACGATn ATGCCGTAG (primer 2)

13

Primer design

n Mispriming areas n Primer length: 18-30 (Usually)n Annealing Temperature (55 - 75 C)n GC content: 35% - 65% (usually)n Avoid regions of secondary structuren 100% complimentarity is not necessary n Avoid self-complimentarity

Primer Design

Online tools:n http://www.hgmp.mrc.ac.uk/GenomeWeb/nuc-

primer.htmln http://www-genome.wi.mit.edu/cgi-

bin/primer/primer3_www.cgin http://www.cybergene.se/primer.html

Software toolsn Omigan Vecter NTI

14

Restriction map

n Restriction enzymen Recognize a patternn Recognition site V.S. Cutting site

n Select restriction enzyme to get a fragment of sequence

n Rebuild the sequence to create or invalidate a restriction site

n Tools: Omiga, remap, bioedit

15

Mutation

n Can be generated by PCRn Primers that not perfectly match

n Frame shift mutationn Insertionn Deletion

n Substitutionn Normaln Silent

16

Mutation

n Test the importancen Mutate suspected important place

n Create a patternn Often silent mutation

n Invalidate a patternn Often silent mutation

n Keep a reading frame

Problem 3

n Can we get the protein product?n Clone it and use a bacteria to express it

n Can we figure out the key residue in the protein product?n Guess the important residuen Mutate the residue to see whether the

activity loses

17

Summary

n Life is determined by nucleotide sequencesn Sequence analysis reveals patterns have

biological significancen Sequence analysis helps the design of wet-lab

experiments

n Next part will be on protein sequence analysis

Recommended