Bioinformatics Lab 2 (Evelyn)

8/10/2019 Bioinformatics Lab 2 (Evelyn)

1/9

Bioinformatics Laboratory 2Sequence Alignment

By the end of this lab, you should:

be able to find similar proteins to your protein of interest have an idea of how to perform sequence alignments using available online tools know where to look for further help on using these tools

We need to have some sequences to work with in this lab. In the Biological Databases lab, welearned how to look up sequence information in one of the many databases available online.Look up the proteinsequences of the genes of your choice and download them to your homedirectory as FASTA files.

Why not the nucleotide sequence? Well, many different triplet codons translate to the same

amino acid and what ultimately determines the structure and thus, function of the protein is itsamino acid sequence. So most of the time, it makes more sense to align protein sequencesrather than nucleotide sequences. Obviously, this is not a strict rule -- there are definitely timeswhen you want to align nucleotide sequences, e.g. for non-protein-coding sequences, regulatoryregions of DNA, etc etc.

1. Finding similar sequences using BLAST

First of all, why do you want to find similar sequences? As an example, say you stumbled upona new protein and somehow got a hold of its sequence. The only problem is that you have noidea what the protein's function is! One way to start figuring out the function using bioinformatics

is to find a bunch of other sequences that are similar to your protein. If other people alreadyfigured out the functions for that set of proteins, then you can infer that your protein probablyhas a similar function, purely based on sequence similarity. Obviously, this is not a guaranteethat your protein actually has that function (the ultimate proof is through experiments), but it's apretty good start....

In lecture, we've been talking about aligning two sequences and assigning scores to analignment such that it will give us information about how similar the two sequences are. We'vealso looked at the Needleman-Wunsch algorithm and started investigating the basis for thescoring schemes. The Needleman-Wunsch algorithm is relatively easy to implement usingdynamic programming. However, it is quite time-consuming and memory-intensive to run onsequences of any significant lengths.

Yet, all bioinformatics students learn about this algorithm, because it gives a good intuition ofhow alignments can be performed and more importantly, what the scores mean. When you wantalignments of "real" sequences, you would probably turn to some of the already availablealignment tools. (But the idea behind the scores still holds - so that's why it's very important tounderstand what alignment scores means!)

Most of you have probably heard of BLAST as a program for finding similar sequences,available online from the NCBI website. BLAST is linked to many of the sequence databases, so
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/


2/9

when you enter a query sequence, it essentially performs many localalignments for you againstthe sequences in the database you choose and gives you the set of highest scoring alignments.Obviously, it's not performing thousands and thousands of Needleman-Wunschs or you wouldnever get a result! In fact, BLAST uses a heuristics-based approach by first splitting up thesequence you enter into a bunch of "words", scanning for these words in the sequencedatabase, and then using the search results to seed alignments extending from the found

words. You can learn a lot more about the actual workings of BLAST this chapter of theNCBIHandbook.

Go to theBLASTwebpage.

Having an available search tool available doesn't necessarily mean life becomes simple! Lookat all the choices you have before you even enter a query sequence! Luckily, the people atNCBI have spent a lot of time writing documentation on BLAST, so check out this tableto seewhat the differences between all the various BLAST programs are. For now, just focus on theones for protein queries.

Since our goal here is to look for sequences similar to the query sequence and our querysequence is most likely longer than 15 residues, protein blast (blastp) seems like adecent choice. So select protein blast (blastp)from the table (or from the main BLASTpage).

You should now see a page with a form containing many boxes for you to fill in. Themain text box labeled 'Enter accession number, gi, or FASTA sequence' is where youput in your query sequence. Not sure what kind of format the sequence should be in?Click on the link next to 'Enter accession number, gi, or FASTA sequence' for anexplanation.

How do we tell BLAST where to look for similar sequences? That's specified by thedatabase we use for the query, chosen by the 'Database' dropdown box. To learn whatthe different databases are, click on the link next to the 'Database' dropdown. Choosingan appropriate database is an important step, since the results you get completelydepend on which databases you choose to query. If you only cared about finding similarsequences in human, you may want to use the 'Organism' text box to narrow down orexclude organisms rather than searching the entire database, which would search allsequences regardless of organism.

There are many other settings for blastpand I suggest clicking on the explanation foreach to find out what that setting is for. If you're really itching to do your first alignment,

just leave everything at its default value and click on the 'Blast' button.

Once you submit your request, you'll be directed to a page with your Request ID. Wait abit to let your request go through the queue. By default, BLASTP will also run a searchagainst the Conserved Domain Database (CDD) and display the results graphically whileit performs the Blast search. The graphic shows protein domains that may be present inyour query. Click on the graphic to get more information about the search results. Here'sa page with help onCDD.

When the server finishes processing your request, it'll show you the set of sequences itfound. Congrats! You just performed your first BLAST search! Below where the results
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch16http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch16http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch16http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch16http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstabhttp://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstabhttp://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstabhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtmlhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtmlhttp://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstabhttp://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch16http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.ch16


3/9

show your query and its length is a graphic titled "Distribution of BLAST Hits". Thisgraphic shows you at a glance what were the significant matches and where they matchup with your query. You should be able to tell from this whether the matches are local orglobal with respect to your query. These matches are also referred to as "target"sequences. Scroll down a bit and you'll see a big list of whole sequences or sequencefragments that matched your query sequence.

o Associated with each sequence is a 'score', which is the pairwise alignmentscore between that sequence and your query sequence.

o You'll also see something called an 'E-value', which is the 'expect value', andessentially tells you how statistically significant your alignment is (more here).The E-value is calculated from the length of the query sequence and thedatabase size. The smaller the E-value the more significant the match. A rule ofthumb is that E-values less than 1e-3 are significant matches.

o Below the line showing the Expect value are three numbers identified as"Identities", "Positives", and "Gaps". Identities give the number of exact matchesbetween the query and the target sequence over the length of this alignment. Ageneral rule of thumb for %id is that it should be 25-30% over an alignment of atleast 80-100 amino acids to be able to assert that the sequences are

homologous, meaning that they are evolutionarily related. If sequences arehomologous, then you have a better case for asserting they share a commonfunction. "Positives" takes into account both exact matches and conservativesubstitutions (ie, similar amino acids). "Gaps" refers to the number of gaps in thealignment. Of course, the lower the number of gaps, the better the alignment.

o By default, BLAST will filter outlow-complexity regionswhich have biased amino-acid composition (eg, sequences of repeated amino acids) because they willskew the results. In the output, BLAST will display these regions grayed out andin lower case.

Note that the alignment scores are based on the scoring matrix that BLAST used - for proteinsequence alignments, this is a very important setting and how you choose an appropriate matrix

depends on many things, including what kind of sequence similarities you want to detect, howlong your sequences are, etc. Here's alonger explanationof substitution matrices.
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expecthttp://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expecthttp://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expecthttp://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#filterhttp://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#filterhttp://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#filterhttp://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.htmlhttp://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.htmlhttp://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.htmlhttp://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.htmlhttp://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#filterhttp://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expect


4/9

1. Graphic Display

2. Hit List


5/9

3. Alignment


6/9

Blast Exercise

The gene DCCis deleted in colorectal cancer and is located on human chromosome 18q21.3. It

encodes for a tumor suppressor protein. Expression of the gene is reduced significantly in most

colorectal carcinomas. The protein sequence of human DCC has the Refseq accession number

of NP_005206.

Locate the Genbank record for this protein. Note the length of the amino acid sequence. Perform a BLASTP search using this protein as the query and Swiss-Prot as the target

database. Limit the search to mammalian species only and use BLOSUM62 as thescoring matrix.

o The DCC protein from human is most closely related to the DCC protein fromwhat other mammal?

o What percent identity do they share?o What is their percent similarity?o What is the length of the alignment? Were both proteins aligned along their entire

length?o Does the DCC protein contain any low-complexity regions that have been

masked out by BLASTP? If so, where? Look at the results for protein with Swiss-Prot id of P97798.

o What percent identity does it share with the query?o What is the alignment length? Is it a global or local alignment for the query and

the target? Based on the BLASTP results, can any general observations be made regarding the

putative function or cellular role of DCC?

2. Protein Databases and Prediction Servers

But what if the sequence you're interested in just isn't similar enough to any other knownsequences, so by sequence alignments, you can't really figure out what your protein does?There are a number of databases which contain information about protein families, domains,motifs, etc. Some of these databases can also help you find similar proteins.

Let's first take a look at PROSITE. Either by pasting in a query sequence or an accessionnumber of the sequence, you can scan the database to see if your protein contains any of theknown protein domains and sites, with the idea that this will help you figure out what the functionof your protein is. Depending on what your protein is, you may/may not have detected somebinding sites, phosphorylation sites, etc etc.

Another thing you can do with your protein sequence is to try to predict its secondary structure

using one of the few prediction servers: PSI-PRED, PredictProtein, and JPRED. On most ofthese sites, you submit your protein sequence along with your email address, and when theresults are ready, you get a notification email.

Finally, there are a couple of databases which contain classifications of structural elements inproteins, such asSCOP,CATH,HOMSTRAD.These databases try to group families of proteinstogether by common structural elements, so you can view all the proteins with coiled-coildomains, for example.
http://au.expasy.org/prosite/http://au.expasy.org/prosite/http://bioinf.cs.ucl.ac.uk/psipred/http://bioinf.cs.ucl.ac.uk/psipred/http://www.embl-heidelberg.de/predictprotein/http://www.embl-heidelberg.de/predictprotein/http://www.compbio.dundee.ac.uk/~www-jpred/http://www.compbio.dundee.ac.uk/~www-jpred/http://scop.mrc-lmb.cam.ac.uk/scop/http://scop.mrc-lmb.cam.ac.uk/scop/http://scop.mrc-lmb.cam.ac.uk/scop/http://cathwww.biochem.ucl.ac.uk/latest/index.htmlhttp://cathwww.biochem.ucl.ac.uk/latest/index.htmlhttp://cathwww.biochem.ucl.ac.uk/latest/index.htmlhttp://www-cryst.bioc.cam.ac.uk/~homstrad/http://www-cryst.bioc.cam.ac.uk/~homstrad/http://www-cryst.bioc.cam.ac.uk/~homstrad/http://www-cryst.bioc.cam.ac.uk/~homstrad/http://cathwww.biochem.ucl.ac.uk/latest/index.htmlhttp://scop.mrc-lmb.cam.ac.uk/scop/http://www.compbio.dundee.ac.uk/~www-jpred/http://www.embl-heidelberg.de/predictprotein/http://bioinf.cs.ucl.ac.uk/psipred/http://au.expasy.org/prosite/


7/9

Answer the following questions:

1. You have a query sequence, which is 28 residues long. You BLAST this sequence againsta non-redundant protein database with 32576 sequences, and having total length of6887085 amino acids. The best hit is in a sequence which is 375 amino acids long. Lookingat the alignment of the best hit, you observe the following (only the high scoring segmentpairs is shown):

Query 23 NFSSSQGY 38

Sbjct 65 NFSTSQGV 82

By using the BLOSUM62 matrix:

a) Give the alignment score.

b) For this database, search K=0.04 and =0.27 values are appropriate. Calculate theexpectation value, E, where E = Kx mx nx e S. nand mare the lengths of the querysequence and the database length, while Sis the alignment score.

2. Imagine you have sequenced a novel fungus genome. There are 6500 predictedgenes in this genome. You do a BLAST search among known fungal genomesusing the translation of predicted genes.

a) Of the 6500 proteins, 5500 have a unique match in other species of fungi (S.cerevisiae, S. pombe, or N. crassa) with a very low E-value. What can you sayabout these proteins?

b) The remaining 1000 proteins each have a best match with an E-value larger than10. What can you say about these proteins?

c) Of these last 1000 proteins, can you make another BLAST search to verify yourconclucion? What else can you try?


8/9

3.

We do a BLAST search to predict the function of our human query protein on two different internet sites

that provide a BLAST search tool. The alignments of best hits are given above.

a) Which hit is statistically more significant? Explain.b) What is the reason for the difference between the two BLAST results?

4. We have a short protein segment from chicken:

We do a BLAST search to predict its funtion. Top 10 hits are given above.

a) Can you predict the function of this protein based on this output?b) What can you do to improve this search?


9/9

BLOSUM62 Matrix

C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2

S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3

T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3

P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4

A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3

G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2

N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4

D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4

E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3

Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2

H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3

K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1

I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3

L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2

V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2

W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 1

Documents

Bioinformatics Lab 2 (Evelyn)