42
Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by retrieving the amino acid sequence for Human claudin-2 from Entrez Gene (see list of useful websites) – then by appropriate use of various BLAST flavours, search parameters and notions of orthology see if you can get to an answer. Use the scratch-pad.html (first item in list of useful websites) to keep notes, accession numbers, sequences, etc. as you go along.

Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Embed Size (px)

Citation preview

Page 1: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Bioinformatics Workshop 2Recap & Warm-Up Exercise

Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2…

Start by retrieving the amino acid sequence for Human claudin-2 from Entrez Gene (see list of useful websites) – then by appropriate use of various BLAST flavours, search parameters and notions of orthology see if you can get to an answer.

Use the scratch-pad.html (first item in list of useful websites) to keep notes, accession numbers, sequences, etc. as you go along.

Page 2: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Answer to: Recap & Warm-Up Exercise1. Get fasta protein sequence

>gi|9966781|ref|NP_065117.1| claudin 2 [Homo sapiens] MASLGLQLVGYILGLLGLLGTLVAMLLPSWKTSSYVGASIVTAVGFSKGLWMECATHSTGITQCDIYSTL LGLPADIQAAQAMMVTSSAISSLACIISVVGMRCTVFCQESRAKDRVAVAGGVFFILGGLLGFIPVAWNL HGILRDFYSPLVPDSMKFEIGEALYLGIISSLFSLIAGIILCFSCSSQRNRSNYYDAYQAQPLATRSSPR PGQPPKVKSEFNSYSLTGYV

2a. tBLASTn against ‘est_others’ database + Xenopus laevis

2b. tBLASTn against ‘est_others’ database + Xenopus tropicalis

(this gives us the ESTs in each species which best match our human protein)

3. Get the top EST sequence for each species, and search each in turn against the human proteins: BLASTx against ‘nr’ + Homo sapiens (this is a check for orthologs)

best laevis EST (CF520733.1 ) gave top 2 human hits:

gi|6912314|ref|NP_036262.1|  claudin 14 [Homo sapiens] >gi|215...   274    1e-73  gi|9966781|ref|NP_065117.1|  claudin 2 [Homo sapiens] >gi|1568...   197    2e-50  So this EST was probably Xl claudin 14.

The best trop EST (DT398005.1) gave top 2 human hits:

gi|4502875|ref|NP_001297.1|  claudin 3 [Homo sapiens] >gi|1635...   340    3e-93  gi|4502877|ref|NP_001296.1|  claudin 4 [Homo sapiens] >gi|1265...   317    2e-86  So this EST was probably Xt-claudin 3.

Page 3: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

BLAST Parameters Exercises5. E-Value maximum for reporting

Open the file example-sequences.html.

Copy the sequence >sumo-binding-motif and go to the NCBI BLAST Home Page.Go to the PROTEIN BLAST section, BLASTp, and paste the sequence.

Run the search with the default values.

Now re-run the search:

setting the maximum E-value in the box -> 100setting the maximum E-value in the box -> 1000setting the maximum E-value in the box -> 10000

What difference does this make?Have you found related proteins in your results?

Page 4: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Bioinformatics Workshop 2Identifying Unknown Genes …

• Open a web browser and type in the URL:– informatics.gurdon.cam.ac.uk/online/workshops– bookmark this page

• Click on the link to the file:– useful-websites.html– bookmark this page too– it also contains links to the example sequence

files used in the workshop, and the presentations themselves

Page 5: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Part 1:Genome BrowsersNow that most model organisms have had their genomes sequenced, we can get a lot more information about how the gene works, than by just doing a BLAST search against the protein databases.

Even if ‘your’ favourite genome is still just in ‘scaffolds’ and not yet assembled into chromosomes, we can still add a lot of value.

The main tasks that one does to a genome before releasing it to the user community is to annotate it. In practice this means adding gene models, based on known expressed sequences, both in the same organism and other fairly closely related ones, and possibly also purely predicted ones based on sequence composition analysis and ‘features’ like start and stop codons, and splice sites. And then known mapping markers, SNPs, etc, etc.

With ~3,000,000,000 nucleotides in the genome sequence (human) this present a considerable challenge to display on a web browser page, which is of course the preferred option. Most genome browsers (software designed to display genome based data in a web broswer) have taken roughly the same approach, which we’ll take a quick look at…

Page 6: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Gene model

Aligned ESTs

genome

gene model

Aligned cDNA

Page 7: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Schematic Genome Browser

24000 25000 26000 27000

Mus musculus, chromosome 12

genome

navigate zoom- +

TRACKS

Your sequence

Genes

ESTs

conservation

Human

Fish

Page 8: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by
Page 9: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

How to Use UCSC Browser

Page 10: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Displaying your own dataYou can also use the UCSC browser to display you own data…

Not just your blasted sequence.

Simply create a text file in one of several specified formats, e.g.

-------------------------------------------------------------------------------------------------------------browser position chr1:1,000,000-1,050,000

track name=track1 visibility=1 description="My display data" itemRgb="On" priority=1

chr1 1006500 1008500 1006500 0 + 1006500 1008500 0,0,255 chr1 1011500 1012750 1011500 0 + 1011500 1012750 0,100,150 chr1 1015250 1016500 1015250 0 + 1015250 1016500 0,100,150 chr1 1018000 1021000 1018000 0 + 1018000 1021000 0,170,80 chr1 1024500 1028000 1024500 0 + 1024500 1028000 80,170,0 ::

-------------------------------------------------------------------------------------------------------------

And load via the ‘Genomes’ / ‘manage custom tracks’ facility.

These mechanisms are well documented on the UCSC site.

Page 11: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Exercises1.

Find the web site for the Santa Cruz Genome Browser (sometimes called the Golden Path), and investigate the three genes for which you have the full length cDNA sequence, or the protein sequence, in the file example-sequences.html

>TNeu084i05 (Xenopus)How many exons does the gene appear to have?Has it been mapped already?Are there any likely upstream regulatory elements (look for conservation across species)? Are there other genes near by?

>TGas122d03 (Xenopus)Is this a relatively unique gene, or a member of a gene family?What can we learn from the comparison with human genes?Are there any differences between the gene model predicted from your cDNA, and the existing predictions?

>hsp70-5 (human)Starts with the protein sequence. How might this be better?

Page 12: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Exercise 1. Results >TNeu084i05

Page 13: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Exercises2.

Now go to the two other main genome browsers, Ensemble and NCBI – find the Xenopus genome (at the moment you won’t find it at NCBI, so use the mouse genome instead), and see if you get the same sort of functionality from them. Use the same two sequences.Are there different features?Are they easier/harder to use?

Page 14: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Part 2: Identifying Novel Proteins

sequence to analyse

what is its function?

Gravin-like

database of proteins in other species

Cyclin-AFoxA1

cdc25

alpha-tubulin

Predicted protein

Sprouty-2

calmodulin

KIAA10786568

frizzled

Wint8

Troponin T3

BLAST

FUNCTIONAL ANNOTATION

Page 15: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Suppose you have a cDNA sequence and you run BLASTx:

1 - genes of identifiably same function in several different species4e-014 - polyunsaturated fatty acid elongase [Xenopus laevis] 7e-140 - fatty acid elongase 2 [Rattus norvegicus] 1e-140 - ELOVL6 protein [Homo sapiens]

2 - genes of unknown function in several different species2e-103 - unnamed protein product [Tetraodon nigroviridis] 3e-115 - 2310009N05Rik protein [Mus musculus] 5e-117 - hypothetical protein FLJ22378 [Homo sapiens]

3 - genes with no significant BLASTx hits in other species7.3 - 1-deoxy-D-xylulose 5-phosphate synthase [Chlamydophila abortus]4.7 - PREDICTED: similar to tweety 2 isoform 1 [Bos taurus]

4 - significant BLASTx hits in phylogenetically distant species 2e-200 – coat maintenance protein [Escherichia coli]

KNOWN

NOVEL

ORPHAN

OUCH..!

Different Possible Outcomes

Page 16: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Different Ways not to Know Anything

Your lack of knowledge about protein function, having directly compared your sequence with all known proteins in the database, will manifest itself in two rather different ways.

1. It looks like a NOVEL gene – we find plenty of evidence for orthologous genes, but these are just different ways of saying but we know nothing about their function either.

2. It looks like an ORPHAN gene – this is a sign that this protein may only exists in your organism. The phenomenon is quite well documented (see reference). Obviously these are going to be quite tough to work on, as nothing like them has been seen before…

Special case. There are good BLASTx matches with phylogenetically DISTANT organisms – check for contamination!

An Evolutionary Analysis of Orphan Genes in Drosophila.Domazet-Loso T, Tautz D.

Genome Res. 2003 Oct; 13(10): 2213-2219.

Page 17: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Indirect Functional IdentificationSo you’ve found a gene you’re interested in, you’ve blasted it against the biggest protein database you can find, and have still got no real clues as to what its function might be. What do you do next…

(make sure you really have a gene on your hands)

1. LOOK FOR MORE DISTANTLY RELATED GENES WITH ANNOTATIONIf there are believable BLASTx matches, but they are all predicted genes with no functional annotation, it might still be possible to use them as stepping stones to other, more informative, BLASTx matches which would not show up as similar to the original sequence. Think of this as traversing the phylogenetic tree.

2. FIND PARTIAL OR INDIRECT DATA – DOMAINS, EXPRESSION, ETC.Accumulate as much partial data about the sequence in the hopes that it sheds light on the function. This will include functional protein domains, expression data, genomic alignment and secondary structure. It’s unlikely that you will become casually involved with higher order structures as solving or comparing these is a complex and specialised task.

Page 18: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Phylogenetic Stepping StonesConsider a gene which has the same function across many phyla, and suppose we consider a phylogenetic tree based on sequence similarity:

your species

species D

species C

species B

species E – function known

It’s possible that the sequence of the gene in your species is sufficiently similar to its orthologs in species B and C that these will show up in a BLAST search, but not in species D or E. But the sequence of the gene in species C is more similar to those in D or E. So once you get to C, and BLAST from there you might get to E, which happens to have been researched and its function known.

This could be done manually, but it has been formalised in PSI BLAST, which uses iterative rounds of BLAST searching to build a more generalised model of the gene sequence, and uses this ‘evolving’ model to gradually traverse the tree. Although if not used carefully it can go horribly wrong…

Page 19: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Initial Query

SREFTHYQWERLIKKTYFARFHNCMLISFSWER

Matches from databaseSREKTSYQAERLIIWERFARFHICMLIPQSWERSREKDSYQUERLIPWTYFARFHNCMLIPKSWER

New Composite QuerySREFTHYQWERLIKKTYFARFHNCMLISFSWER K S A IWER I PQ D U P T K

2nd Round Matches from databaseSREKTSYQAERLIIWERFARFHICMLIPQSWERSREKDSYQUERLIPWTYFARFHNCMLIPKSWER

PRAKDTRQIQRLSYWTTFLLFVITSLQRKITERPRAKDTRQIQRLSYWTTFLLFVITSLQRKITER

And so on…

PSI BLAST(Position Specific Iterated – since you asked)

Page 20: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

PSI BLASTRound 1 results

Page 21: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

PSI BLASTRound 2 results

Page 22: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

PSI BLASTRound 3 results

Page 23: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

PSI BLASTRound 4 results

Page 24: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Finally some function!

Page 25: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Functional Domain AnalysisProteins are considered to have functional domains within them, specific regions of the protein which have specific tasks, and that these domains are recognisably conserved between different proteins, even though the overall similarities of the proteins may be quite low.

Typical Diagram of Functional Domains on a Protein

Page 26: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Functional DomainIf you can find functional domains, you may know something about the general behaviour of your protein, even if you don’t know exactly what its function is. But, as usual, be aware that non-significant matches are quite likely to be displayed in any analysis website – and at least look for some confidence score or other measure of significance. And treat everything with a degree of caution.

Main specialised sites for this type of analysis are SMART and Pfam. Which have considerable overlapping functionality. Also InterProScan which attempts to integrate all the available tools…

The search methods are rather different from BLAST, and rely primarily on building up a model of the functional domain from known examples. The model is then a generalised pattern for a given domain, and your unknown sequences are searched against the models, using rather more advanced methods, typically involving Hidden Markoff models.

Page 27: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Once a functional domain has been identified in a number of sequences, we can build a model of it. By which we just mean a summation of our understanding of the linear sequence variants.

1234567890YSCMVGHEALFSCVVGHEAL 1 2 3 4 5 6 7 8 9 0YTCKVDHETL model YF ST C ? V ? H ~E ? ~LFTCQVTHEGD YSCRVKHVTL score 5 5 10 10 10 8 8YTCVVGHEAL

The scores may be arbitrary but they constitute the Hidden Markoff Model by which we evaluate other proteins to see if they contain this domain. As you accumulate more examples the model gets more refined, and hopefully more accurate…The higher the score of your test protein sequence against the model the more likely it is presumed to contain the domain.The model will also allow for the possibility of (expensive) gaps if the spacing of your real sequence doesn’t fit the model. Known variable regions can be modelled as cheaper gaps.

Once a functional domain has been identified in a number of sequences, we can build a model of it. By which we just mean a summation of our understanding of the linear sequence variants.

1234567890YSCMVGHEALFSCVVGHEALYTCKVDHETL FTCQVTHEGD YSCRVKHVTLYTCVVGHEAL

The scores may be arbitrary but they constitute the Hidden Markoff Model by which we evaluate other proteins to see if they contain this domain. As you accumulate more examples the model gets more refined, and hopefully more accurate…The higher the score of your test protein sequence against the model the more likely it is presumed to contain the domain.The model will also allow for the possibility of (expensive) gaps if the spacing of your real sequence doesn’t fit the model. Known variable regions can be modelled as cheaper gaps.

Functional Domains and Hidden Markoff Models

Page 28: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Problems with Models by Example

There are two conceptual problems with building models from examples.

The likelihood is that the behaviour of the protein domain is related to the three dimensional shape of the molecule, and the nature of its interactions with other molecules, and as we are not taking these into account at all, we cannot expect our model to be very realistic.

Secondly, the model is (by its nature) highly biased towards the examples already found, and further examples found with the help of the model will tend to reinforce any initial bias. So our model may tend to grow away from the actual consensus across all possible proteins, and lock us out of whole subsets of data.

Incidentally this problem of bias is very similar to what can happen with PSI BLAST if your choice of proteins to include in your growing model diverge from your original sequence too much, and can quickly take you off into strange territory…

Page 29: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by
Page 30: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Using SMART

Page 31: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Exercise 1: Using Pfam and SMARTOnline Scratch PadFor the following exercises, you may find a scratch pad useful for keeping information from previous stages of a search. If you open up the file scratch-pad,htmlyou’ll find you can keep text data in the outlined box. You cannot save the data, and it’ll vanish if you close the window, or refresh it!

Go to the example-sequences.html file and the Protein Domain Searches section, and copy the sequence for >igf4D.Then go to the SMART web site, paste your sequence, tick at least the signal peptides box, and then run the search.While that’s running, go to the Pfam site (in a new browser window) and search the same sequence there.Compare the two results sets. Is there any difference? Should we expect any?Now go to the NCBI BLAST page, and do a protein-protein BLASTp – this may be a useful way of getting to the same data.What could you have learned about the function of this gene?

If you are ahead of the rest of the group, check out the results for the much longer >titin sequence.

Page 32: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Using SMART

Page 33: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Exercise 2: Random Sequences Again

We recall that random DNA sequences gave us alignments against real proteins when using BLASTx, and that E-values can gave us a good idea whether alignments are biologically meaningful or not.

This becomes even more important when searching for subtler matches – generally shorter sequences with considerable variation allowed at most positions.

Go to the file random-protein-sequences.html and copy the sequence assigned to you. Go to whichever of Pfam or SMART web sites you preferred, and run the search on your sequence.Did you find any domain hits?Were they significant?Was it possible to tell?Look at the actual alignments, if you can find out how to, and also see if you can find the model that the domain is based on.

Repeat with a second sequence if you have time.

Page 34: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Functional Motifs in Proteins

You may be more familiar with functional motifs in DNA sequences, e.g. transcription factor binding sites.Here for example is the (Xenopus) TBox motif: T[CG]A[CG]AC[CG]T

But short motifs are also present in protein sequences, e.g:

FHA domain interaction motif 1: T..[ILA] ( Forkhead-associated (FHA) domain binds phosphothreonine or phosphoserine containing peptides )

The general problem with motifs is the number of false positives, as they are generally pretty short. For the above example we can easily see that (approx) every 20th amino acid will be a T, and about 1 in 7 of these will have ILorA in the third position following. So this motif should appear about every 140 amino acids in a random sequence…

This implies a pretty high rate of (probably) false positives – and the almost certain need for confirmatory biology!

Page 35: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

The ELM Server

Eukaryotic Linear Motif

The ELM server (http://elm.eu.org/)“ELM is a resource for predicting functional sites in eukaryotic proteins. Putative functional sites are identified by patterns (regular expressions). To improve the predictive power, context-based rules and logical filters are applied to reduce the amount of false positives.”

We can judge the problem of interpreting these searches if we use a randomly generated sequence and send it to the ELM server…

Page 36: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Functional Motifs Reported by ELM in a Random Amino Acid Sequence

Page 37: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Secondary Structure AnalysisThe weak neighbour-neighbour interactions between amino acids in a protein molecule give rise to a small number of basic structural arrangements. The two main forms are linear helical structures (alpha-helices) or sheets of parallel chains (beta sheets), the intermolecular bonds stabilise the structures. We may consider that the larger scale structure of the whole protein is built from these smaller scale structures, and as such they may give us some insight into the role of the protein even in the absence of much functional data.

3-dimensional protein structures that you see pictures of, are often composed of alpha-helices and beta-sheets linked by less well structured sections of the protein.

There are a large number of web pages devoted to analysing proteins for secondary structure, and even some which attempt to aggregate the results of several different methods (at PBIL).

http://ww

w.chem

soc.org/exemplarchem

/entries/2004/durham_m

cdowall/prot-3.htm

l

Page 38: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Is it Really a Gene? If you are really getting nowhere with your functional analysis, it may worth checking whether you have got a gene at all.

There are several circumstances in which this might arise.

If you are using a physical reagent like a cDNA clone, it’s possible that it contains an incomplete mRNA sequence, and you are just looking at a plausible but unreal ORF in the 3’ UTR. Or it could contain an unspliced immature transcript. Or it could even be a contamination from some other, very different species, e.g. bacteria. You may learn a lot by aligning your sequence with the organism’s genome, to check that it is there and that it appears to have exons (if you would expect them).

Or if you found the gene by some sort of mapping/positional analysis, and you are analysing sequences from gene models shown on the genome, check that there is real (e.g. EST) evidence for this gene – it may be purely theoretical, and entirely bogus…

Page 39: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Genomic Analysis

It is possible that analysing the position of your gene on the genome can tell you something about its possible function.

Genes sometimes function in ‘expression cassettes’, where neighbouring genes are either co-expressed, or under closely related (temporal or spatial) regulation. So if nearby genes are well characterised it would be worth considering this as a possibility.

Equally, if there are obvious orthologs of this gene in other species, check out the genomic context there too.

You should also be able to find out if your gene is a member of a gene family, or whether it shares small regions of coding sequence with other genes. Is there a way of doing tBLASTn or tBLASTx against the genome in your preferred browser?

Page 40: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Expression Data

Genes that are co-expressed may well be involved in the same pathways, the more intricate the pattern of co-expression, the greater the likelihood. You may find genes of known function that yours is associated with.

If you found the gene originally in an expression array experiment this may be an easy way in. Alternatively there is a growing amount of expression data out there in databases, although at the moment it’s pretty difficult to systematically mine it. Various efforts are underway to facilitate this (FlyMine, ArrayExpress) tho’ it’s not clear how effective these are yet. It may also be difficult to track ‘your gene’ down in the data sets.

If your gene is from an EST or cDNA sequence, see if the ESTs are clustered and check out which libraries they come from. This may tell you whether your gene is expressed in specific stages/tissues, or whether it is more ubiquitous.

Page 41: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Exercise 3: Genuine Unknowns

The sequence file identification-example-sequences.html contains 12 gene sequences from Xenopus tropicalis which superficially look hard to identify. The full cDNA sequence, is given along with the amino acid sequence translated from the presumed ORF.

Start with the first sequence, and accumulate data about it, then work your way on down the list…

Consider doing the following searches:1. Check BLASTx/p – new sequences are arriving on the database all the time2. Consider whether PSI BLAST might be useful3. Check against the genome4. Look for functional protein domains5. Look for secondary structure

If you find anything that looks useful keep a note of it.

But bear in mind that, in the real world, you may soon be thinking about going back to the laboratory for further experimental work!

Page 42: Bioinformatics Workshop 2 Recap & Warm-Up Exercise Determine whether there is an available Xenopus clone (laevis or tropicalis) for Claudin-2… Start by

Exercise 3: Results>u-one Xt6.1-CAAL21151.3

Dpy30, SCOP domains – PSI 2 rounds -> chloroplast enolase?ADP-ribosylation factor-like

>u-two Xt6.1-CABJ8169.5 sipP, RUN, PDZ, PTB domains – PSI 2 rounds -> rap2 interacting protein x

>u-three TEgg047e16 clear orphan, no domains, no results with PSI BLAST, Egg/Ova/Gas EST expression

>u-four IMAGE:7016814 Globin domains, odd organisms, no hit on genome - worm contamination, adult whole body lib.

>u-five IMAGE:5384335 signal peptide, seven transmembrane regions (!)

>u-six TEgg044i21 signal peptide, coiled coils domain - PSI 2 rounds -> yeast-tht1

>u-nine CABE11813 long protein, no domains, no more additions after 2 rounds of PSI BLAST, all_predicted

>u-ten TGas024h08

long protein, no domains, sort-of-name, PSI 2 rounds -> chloroplast RNA processing 1 1e-05...