24
Biochimie II Introduction aux Outils Informatiques Appliqu´ es ` a la Biologie Daniel Abegg [email protected] Assistants : Thomas Falgui ` eres — Francine Dreier Marie-Claude Blatter — Olivier Schaad — Thierry Soldati Salle Baud-Bovy BB03 25 f´ evrier 2009

Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Biochimie II

Introduction aux Outils InformatiquesAppliques a la Biologie

Daniel Abegg

[email protected]

Assistants :Thomas Falguieres — Francine Dreier

Marie-Claude Blatter — Olivier Schaad — Thierry Soldati

Salle Baud-Bovy BB03

25 fevrier 2009

Page 2: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of
Page 3: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

1 Introduction to biological databases

Look for specific databases

– Try to find an database (its corresponding home server address and the date of thelatest update) dealing with :

Dictyostelium discoideum http://dictybase.org/ diactybase ? ? ?Drosophila http://flybase.org/ flybase 23 Jan 2009Cotton http://cottondb.org/ cottonDB 31 Juil 2008Restriction enzymes http://rebase.neb.com REBASE ? ? ?

Gene Ontologyhttp:

//www.geneontology.orgGene Ontology 19 Mars 2008

Transcriptomic data(microarray data)

http://www.ebi.ac.uk/

microarray-as/ae/EMBL-EBI Mai 2008

Human genes and geneticdisorders

http:

//www.ncbi.nlm.nih.gov/

sites/entrez?db=omim

OMIM daily

Lipids http://www.lipidmaps.org/ Lipid Maps 18 Juin 2008

Searching for sequences (1)....

– Enter ”ken and barbie” in the text search box of UniProt websiteIn which species do you find a sequence for this gene ?Does it mean that this gene exist only in this species ?

This gene sequence is found in Drosophila melanogaster but it doesn’t mean that it isthe only species.

– Have a look at the Drosophila melanogaster UniProtKB entry for this gene.Find the protein, the RNA and corresponding genomic sequences : list their accessionnumbers (ACs) for each of these sequence categories.

Sequence AC DatabasemRNA AJ012576 EMBLgenomic DNA AB010261 EMBLprotein O77459 uniprot

– Could you get information about a person to contact in order to ask for an alreadycloned cDNA ?

Mark Stapleton et al. ([email protected]) published an article : A Drosophila full-length cDNA resource

1

Page 4: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

– Do the same search at NCBIIn which species do you find a sequence for this gene ?

The ”ken and barbie” gene search with NCBI gave the following species : Apis melliferaand Acyrthosiphon pisum.

– Find a protein, a RNA and a corresponding genomic sequences for Drosophila melanogaster’ken and barbie’ : list the accession numbers (ACs) for each of these sequence cate-gories.

Sequence AC DatabasemRNA NM 079109 NCBIgenomic DNA NT 033778 NCBIprotein NP 523833 NCBI

– Look for AM948965 sequence at NCBIWhat does this accession number (AC) correspond to ?Find the corresponding publication.Display the sequence in different formats (Fasta, GenBank format...)

This accession number corresponds to the : Homo sapiens neanderthalensis completemitochondrial genome. The publication is : A complete Neandertal mitochondrialgenome sequence determined by high-throughput sequencing (PMID : 18692465).

The begin of the sequence in two different formats

FASTA>gi|195972535|emb|AM948965.1| Homo sapiens neanderthalensis complete mitochondrial genomeGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG

GenBankLOCUS AM948965 16565 bp DNA circular PRI 20-AUG-2008DEFINITION Homo sapiens neanderthalensis complete mitochondrial genome.

Searching for sequences (2)....

– Look for Mammoth, Dodo and Tyrannosaurus protein sequences.

Animal Number of proteinsMammoth 95Dodo 14Tyrannosaurus 3

– Look for the complete genomic sequence of E.coli strain K12 at NCBI : how manygenes are there ?Display the sequence in Fasta format.

There are 4444 gene in the strain K12 from E.coli.

>gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W3110 DNA, complete genomeAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

2

Page 5: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Genome overview

– Go to Map Viewer at NCBI How many chimp chromosomes do you see ?

The chimp has 25 chromosomes : 24 (2 different copies of chromosome 2) + 1).

– Look for data available for human chromosome X : how many genes do you see ?

The human chromosome X has 1529 genes. The AC of the 5’ telomeric sequence isNT 086925. The repetion on this gene starts about at the 16000 nucleotide position.

Genomic databases : follow the links

– Look for the EcoGene database. What type of data, species do you find ?

Database of Escherichia coli Sequence and Function

– Look for the gene gutQ.Find its chromosomal location.

Left End: 2827835 ----------------- Clockwise ----------------- Right End: 2828800Minute or Centisome (%) = 60.95

– Find the next gene on the same strand.

The next gene is norV

– Follow the link to UniProtKB/Swiss-Prot.Find the subcellular location of the protein.

This protein seems to be located in the cytoplasm.

Query UniProtKB

– Find all the nuclear proteins of Dictyostelium discoideum. How many are there ?Do you think you get a complete set ? Why ?What are the shortest and largest known protein sequences ?

There are 427 nuclear proteins of Dictyostelium but there could me more so this isnot a complete set.The longest protein (Midasin) is 5900 amino acid long and the shortest is a DNA-directed RNA polymerases I, II, and III subunit rpabc4 with 46 a.a.

3

Page 6: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

3D structure

– Look for the data on human insulin in PDB (PDB accession number : 1A7F)What are the ’experimental data’ stored in the PDB database ?

The expermental data are NMR specters.

– To obtain the 3D structure, click on ’Quick pdb’ for example.

Fig. 1 – 3D structure of human insulin (pdb entry 1A7F)

Metabolic database (KEGG)

– Look for data on glycolysis in KEGG.Find the name of the enzyme which catalyzes the conversion of fructose 1,6 P2 intoglyceraldehyde 3P.

The name of the enzyme is ALDO (entry : K01623)

– Compare with the same pathway in sea urchin. Does the enzyme also exist in thisspecies ?

The enzyme also exists in sea urchin (entry : 548623)

4

Page 7: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Polymorphism database (dbSNP)

– Look for information in the dbSNP database on the human blue eye variant rs12913832(A= ancestral brown allele, G = blue allele)In which gene do we find this polymorphism ?Find the corresponding publication/citation.

This polymorphism is found in the HERC2 gene.The corresponding publication is ”Blue eye color in humans may be caused by a per-fectly associated founder mutation in a regulatory element located within the HERC2gene inhibiting OCA2 expression.” written by Eiberg H et al. (PMID : 18172690).

– What is the Craig Venter’s ’eye color’ (look at the Celera genome assembly (= CraigVenter) sequence) ?Follow the link to the Alfred database to look for the population distribution of the’blue eye allele’ (Google map).In which part of Europe is the blue allele the least prevalent ?Can you propose a hypothesis for this geographical distribution ?

Craig Venter has the G allele this means that he has blue eyes.On the Google map it is seen that in Europe, Spain has the less people with blue eyes.This geographical distribution could be due to a mutation before the migrating andas it was not seen as a disadvantage in the north region, were there is less sun, it waskept.

5

Page 8: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

2 Protein sequence analysis

Primary sequence analysis

– Find the physico-chemical parameters of the protein sequence (Seq 3 and Seq 4)(use ProtParam, ’MW, pI, Titration curve’ and SAPS) Look in particular for thenumber of amino acids, the PM (kD), the pI, the extinction molar coefficient and thetotal number of atoms and the chemical formula for each protein.

nub a.a PM (kD) pI ελ=280nm M-1 cm-1 total atoms formulaseq3 988 109770.1 9.16 52745 15426 C4811H7715N14130O1448S39

seq4 1127 126363.8 6.68 122185 17823 C5680H8942N1498O1647S56

Values found with ProtParam and compared with pI, Titration curve and SAPS.

Topology - Transmembrane prediction

– Can you predict the subcellular location of the protein (use PSORT) ?Can you predict the position of possible signal peptide (SignalP) ?Can you predict the position of possible transmembrane segment(s) ? Compare HMM-TOP, TMHMM, TMpred results(Pay attention to the required sequence format !)

PSORT SignalP HMMTOP TMHMM TMpredseq3 nuclear (94.1%) NO NO NO (Yes)seq4 cytoplamic (94.1%) NO Yes Yes Yes

For sequence 3, TMpred predicted transmembrane segment but this is impossible be-cause the protein is nuclear.

Fig. 2 – Possible transmembranes domains in sequence 4 proposed by TMHMM tool

6

Page 9: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Post-tranlsational modification (PTM) prediction

– Take your favorite protein sequence (Seq 3 and Seq 4) sequence.First look at the biological information available for each type of PTMCompare the results obtained with different phosphorylation prediction tools (NetPhosand NetPhosK).Compare the results obtained with different myristoylation prediction tools.Compare the results obtained with different glycosylation prediction tools (YinOYang,NetNGlyc).What conclusion can you draw about the presence of these PTMs in your sequence ?

NetPhos (position) NetPhosK (position) Myristoylator NMTseq3 Ser :41 Thr :7 Tyr :7 PKA : 873 NO NOseq4 Ser :31 Thr :16 Tyr :8 PKC : 367 NO NO

PKC phosphorylates sequence 4 at position 367 which is announce as a transmembranedomain (by TMpred) this means it’s impossible. With TMHMM (figure above) the po-sition 367 is in the cytosol and therefore a possible site for PKC.

Fig. 3 – O-GlcNAc sites in sequence 3 pre-dicted by YinOYang.

Fig. 4 – O-GlcNAc sites in sequence 4 pre-dicted by YinOYang.

NetNGlyc predicted for sequence 3 Nglycolysation sites but a nuclear protein can’thave this modification. Three sites are particularly probable for Nglycolysation in se-quence 4 : at position 317 (76%), at position 360 (75%) and at position 530 (61%).

The post-translational modification corresponds with the earlier data found.

7

Page 10: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

BLAST

– Look for one of the protein sequence(Seq 3 and Seq 4).Perform a BLAST search

BLAST @NCBIBLAST @ExPASy

Note that the first hit may correspond to the same sequence stored in different se-quence databases (UniProtKB and RefSeq).To which protein family does your favorite protein sequence belong to ?Look at the data available for the best hit with BLAST @ExPASy and compare theannotation of the corresponding entry with the prediction results you get in the previ-ous exercises.

The sequence 3 is a hypothetical protein with the AC : AAK18922 (NCBI). The proteinis in C. elegans :100% for NCBI and 73% for Expasy where the protein is uncharacter-ized. This protein could be involved in in post-transcriptional gene expression processesincluding mRNA and rRNA (info taken form NCBI). Uniprot entry (O01864) for thisthe protein evokes nucleic acid binding or nucleotide binding as molecular function.Those function are in pair with the nuclear location of the protein.

Sequence 4 is also a hypothetical protein, AC is NP 001023542 (NCBI). It is also inC.elegans : 100% for NCBI and 96% for Expasy where the protein is uncharacterized.The regions indicates take the protein could be involved as a cation transport ATPaseand a E1-E2 ATPase (NCBI). Uniprot entry (Q9N323) for this the protein proposeshydrolase as molecular function. The protein is also said to be in the membrane and ithas a transmembrane domain (uniprot) which is in correlation with previous results.

8

Page 11: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

From sequencing to biological information

– Read the following sequencing gelWhat is the function of the corresponding gene product ?

Fig. 5 – Sequence : cagaagaggccatcaagcacatcactgtccttctgccatggccc

The NCBI blast finds that it is the insulin mRNA from the Homo sapiens. Insulinis function a an hormone secreted when the blood glucose concentration is height antherefore it activates glucose uptake by the liver.

BLAST specificity

– Take a random DNA sequencefor example : attatacgtatataattccgataatcgcgctgaUsing BLAST @NCBI try to find it in the human genome

It is impossible to find this sequence in the human genome. The best hit cover onlyabout half of the random sequence.

– Perform a BLAST search with a fragment of the insulin gene : ctgggcgggg gccctggtgcaggcagcctgRepeat the exercise using a mutated insulin sequence : ctgggcgggg gccctggtgc ag-gcagcatg.Insulin is also found in the NCBI blast with the mutation.

9

Page 12: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

– Have a look at the mammoth genome projectDoes the gene ’ken and barbie’ exist in mammoth ?

The ”ken and barbie” gene was not found in the mammoth genome project database.

Summary exercise

– By using the following human protein sequence, do the most complete and primarysequence analysis including the subcellular location and PTM prediction (try to be asclose as possible to the biology interest in the order of the analysis).

The NCBI blast found the corresponding protein which is fibronectin. This protein has amolecular weight of 262606.5 (kD), a pI of 5.45 and it’s formula is C11486H17822N3206O3681S90

(ProtParam). There are no transmembrane domains with HMMTOP, TMHMM butTMpred found some which is not logical for fibronectin because this protein is secreted,present in extracellular space and extracellular matrix (uniprot : P02751). Signal se-quences were found with SignalP.NetPhos predicted phosphorylation site at serine 79, threonine 57 and tyrosine 25. Net-PhosK found at PKC site at position 29 and there is no myristoylation (Myristoylatorand NMT). Fibronectin seems to have some Nglycolysation (NetNGly) site like at po-sition 430 (76%), at 542 (72%) and 1244 (71%) and many O-GLcNAc site (YinOYang)like shown on the picture below.

Fig. 6 – O-GLcNAc site for fibronectin (YinOYang)

10

Page 13: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

3 Phylogenetic analysis

Start playing with...

– .... Philophylo

Compare some of the trees obtained depending on the input sequences and/or thenumber of input sequences.Which protein (of those provided by this dataset) has been the most ’conserved’ duringthe course of evolution ?Do you have an idea why ?

The histine H4 is mostly conserved because of it’s important function to compressDNA.

– Which protein is the most ’universal’ (= present in most of the species) ?

The most universal protein is the Cytochrome B.

Compare protein sequence by multiple alignment

– Here are the sequences of 5 orthologous genes (i.e. the same gene in 5 different species)ARP2 A ARP2 B ARP2 C ARP2 D ARP2 EDo a multiple alignment by using one of the alignment tool available on ExPASy. Com-pare the results obtained by the different tools.

T-COFFEE Output

CLUSTAL FORMAT for T-COFFEE Version_5.05 [http://www.tcoffee.org], SCORE=78, Nseq=5, Len=60

ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEEARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDEARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDEARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDEARP2_A MES---APIVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE

*:* :* ******** *** *** . **::****::*: : *::::**:***:*

CLUSTALW 2.0.10 multiple sequence alignment

ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE 60ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE 60ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE 60ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE 60ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE 57

*:* :* ******** *** *** . **::****::*: : *::::**:***:*

11

Page 14: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Muscle

>ARP2_AMESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE>ARP2_EMDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE>ARP2_CMDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE>ARP2_DMDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE>ARP2_BMDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE

In conclusion the Gap is at the same place.

Manual phylogenetic analysis

– Look for the multiple sequence alignment obtained aboveFill-up the following ’distance-matrix’, by counting the differences between the se-quences (if necessary, re-do an alignment with the sequences 2 by 2).

A B C D EA – – – – –B 24 – – – –C 24 10 – – –D 24 10 0 – –E 25 15 13 13 –

– Knowing that species are :

Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe

...which sequence is likely to correspond to which species ?

A=Schizosaccharomyces pombeB=Caenorhabditis briggsaeC=Mus musculusD=Homo sapiensE=Drosophila melanogaster

12

Page 15: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

- Try to draw a phylogenetic tree

Fig. 7 – phylogenetic tree of a orthologous gene sequence of species : Schizosaccha-romyces pombe, Caenorhabditis briggsae, Mus musculus, Homo sapiens and Drosophilamelanogaster

Phylogenetic analysis

– Get the ARP2 protein sequences from human, mouse, fruit fly, worm and fission yeastfrom UniProtKB in Fasta format. The sequences are those of the previous exercise.

Homo sapiens (Human)

>sp|P61160|ARP2_HUMAN Actin-related protein 2 OS=Homo sapiens GN=ACTR2 PE=1 SV=1MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDEASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNREKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTRRLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVLVESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKHIVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAVLADIMKDKDNFWMTRQEYQEKGVRVLEKLGVTVR

Mus musculus (Mouse)

>sp|P61161|ARP2_MOUSE Actin-related protein 2 OS=Mus musculus GN=Actr2 PE=1 SV=1MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDEASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNREKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTRRLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVLVESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKHIVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAVLADIMKDKDNFWMTRQEYQEKGVRVLEKLGVTVR

Drosophila melanogaster (Fruit fly)

>sp|P45888|ARP2_DROME Actin-related protein 2 OS=Drosophila melanogaster GN=Arp14D PE=2 SV=2MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDEASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNREKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTRRLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVLVESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKHIVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAVLAEVTKDRDGFWMSKQEYQEQGLKVLQKLQKISH

13

Page 16: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Caenorhabditis briggsae

>sp|Q61JZ2|ARP2_CAEBR Actin-related protein 2 OS=Caenorhabditis briggsae GN=arx-2 PE=3 SV=1MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEECSQLRQMLDINYPMDNGIVRNWDDMGHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNREKMFQVMFEQYGFNSIYVAAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTRRLDIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVLSQQYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKHIVLSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAVLANLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKA

Schizosaccharomyces pombe (Fission yeast)

>sp|Q9UUJ1|ARP2_SCHPO Actin-related protein 2 OS=Schizosaccharomyces pombe GN=arp2 PE=1 SV=1MESAPIVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDEAEAVRSLLQVKYPMENGIIRDFEEMNQLWDYTFFEKLKIDPRGRKILLTEPPMNPVANREKMCETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVGRLDVAGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVLMRNYTLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRAIVLSGGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAVLADIMAQNDHMWVSKAEWEEYGVRALDKLGPRTT

– Reconstruct phylogenetic trees using the ’One click’ analysis methods provided athttp://www.phylogeny.fr/.In case of server problems, use alternative servers for phylogenetic analysis.

Fig. 8 – Phylogenetic tree of the ARP2 protein sequences from human, mouse, fruit fly,worm and fission yeast done with http ://www.phylogeny.fr/ One click method.

14

Page 17: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Phylogenetic analysis

– How many distinct trees do you have on this figure ?

Fig. 9 – All the trees are the same. There is only one tree.

– List the positions on the following trees, where there is- a gene duplication event : 2 and 10

- a speciation event : 1, 3, 4, 5, 6, 7, 8, 9, 11 and 12

Fig. 10 – The number 2 and 10 are duplication and the other ones are speciation

15

Page 18: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

The Tree of Life

– Construct a phylogenetic tree based on dataset 4 using the ’one click’ method at http://www.phylogeny.fr/.To get the correspondence between the 5 letter codes (i.e. ARATH, BACSU) and thespecies, query the UniProt website or look at the document Controlled vocabulary ofspecies@UniProt

– Explain the tree. Locate gene duplication and/or speciation events.Does the resulting tree correspond to the species tree (3 kingdoms (Eucaryota, Archae,Bacteria)) ?

Fig. 11 – The big separation which is indicated as duplication is the only duplication.The other events are specication.

All three kingdoms are present on the tree– Try to explain the position of EFTU ARATH in the tree.

The EFTU ARATH codes for the chloroplast from the chloroplast genome. The originof the chloroplast is from the bacteria which explains it’s position in the tree.

16

Page 19: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Exercise 6

– If you are still alive, construct a tree with your favorite protein (i.e. insulin)...

Fig. 12 – The phylogenetic trees for the ATP synthase subunit a. The sequences werefound on uniprot.

The branchiostoma floridae and the salmo salar are fishes. The sus scrofa is a wild boar.The anopheles gambiae and the aedes aegypti are insect and the metridium senile is asort of anemone.All the separation are speciation.

17

Page 20: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

4 Introduction to gene prediction

non-protein coding RNA (ncRNA) gene prediction

– In a C.elegans genomic sequence (cosmid) :

..look for the presence of tRNA gene(s) with tRNAscan-SE. Use the default ’searchmode’ and the source ’Eukaryotic’. Have a look at the tRNA structure.

Sequence tRNA Bounds tRNA Anti Intron Bounds Cove

Name tRNA # Begin End Type Codon Begin End Score

-------- ------ ---- ------ ---- ----- ----- ---- ------

Cosmid 1 169 238 His ATG 0 0 20.56

Fig. 13 – Image of the tRNA of the given sequence

There is one tRNA.

18

Page 21: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

’Ab initio’ protein-coding gene prediction

– Get gene 1, a genomic sequence from C.elegans, and compare the results of gene pre-dictions obtained by different programs (pay attention to the format of the submittedsequence) :

HMMgeneNetgene2WebGene (Genebuilder) (option : ”First and last coding exons : disabled”)

Draw a shema describing the different predicted gene structures with the positions(numbering) of the exon and intron boundaries.

Fig. 14 – Exon and intron boundaries with different tools on the C.elegans gene. ThemRNA (EST) of the same gene (Blastn from NCBI).

– Compare the results obtained by HMM if you choose ’human’ instead of C.elegans asorganism.Why are they different ?

The boundaries of the found exons are the same on C.elegans but there are threemore on the complementary strand in the human : 1290 to 1418, 1461 to 1650 and1443 to 2522.

19

Page 22: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Protein-coding gene prediction and the use of sequenced mRNAs(ESTs)

– Do a Blastn search at NCBI with the genomic sequence (gene 1)Select C. elegans ESTs (mRNAs).How many different ”RNA” sequence(s) can you retrieve ?

There are 3 different RNA with principaly 4 exons.

– Retrieve the sequence of the mRNA (EST) BJ818152 in Fasta format.Align this ESTs with the genomic sequence by using SIM4 (a alignment tool specificfor cDNA and genomic sequence alignment. SIM4 takes care of the intron/exon bound-aries).Compare the intron/exon boundaries numbering with the results obtained by the pre-diction programs (previous exercise).

>BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditiselegans cDNA clone yk1685h11 3’, mRNA sequence.TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCGACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATGAGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATGACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCGTGGGCAAGGTAAGCGACATTGTTCGATGAA

The EST schema is on the previous figure.

– Do the same job with another EST (with different exon/intron boundaries if they ex-ist).

The BJ775052 EST was chosen and the boundaries are about the same : from 969 to1406, from 1452 to 1661 and 1914 to 2019.

20

Page 23: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

Translation

– Translate the EST BJ818152 sequence by using one of the tools provided at ExPASy(pay attention to the EST sequence orientation !)Select several potential ORFs (open reading frame).Using BlastP (@ExPASy) compare each potential ORF with already known C.elegansprotein sequencesIdentify the correct protein sequence.Try to find the function of the protein.

This is the selected ORF form the translation of the EST BJ818152 (3’5’ becausethe EST was 3’) :

M K V E T C V Y S G Y K I H P G H G K R L V R T D G K V Q I F L S G K

A L K G A K L R R N P R D I R W T V L Y R I K N K K G T H G Q E Q V T

R K K T K K S V Q V V N R A V A G L S L D A I L A K R N Q T E D F R R

Q Q R E Q A A K I A K D A N K A V R A A K A A A N K E K K A S Q P K

T Q Q K T A K N V K T A A P R V G G K R

Fig. 15 – The BlastP from the above ORF.

The found protein is the 60S ribosomal protein L24 from Caenorhabditis elegans. It’smolecular function is a structural constituent of ribosome (uniprot entry : O01868)

– For fun : Translate directly the genomic sequence (gene 1) and try to find the correctprotein sequence.

It is impossible to translate the genomic sequence a to find a protein.

21

Page 24: Biochimie II - University of Geneva · the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of

If you are still alive...

– ...try to find the correct protein sequence encoded by the following genomic sequencefrom C. elegans (gene 2)

The protein corresponding to the gene 2 will be search.First the exon on the gene 2 are located on the complementary strand on the followingpositions : 789 to 1111, 1410 to 1636 and 1688 to 1845 (HMM). A blastn search is doneon the gene 2 (NCBI) and a sequence with similar exons is chosen.

>OSTR075F6_1 AD-wrmcDNA Caenorhabditis elegans cDNA, mRNA sequenceAATTTGCCCGGGTTCCTTCTTCAACGGATCCTCTTCCTCGTCCTTAACTCTTCTGATCTTCTCCTGTTTTCGATACTTCGCCCGCCGATTCTGAAACCACACTTGAACTCGGGCTTCAGTTAAATCAATTCTCATTGCAATTTCTTCTCGTGTATAAATATCTGGATAATGAGTTTCACAGAATGATCTTTCCAACTCCTTCAGTTGTCCTGATGTGAATGTGGTACGGATTCGGCGTTGTTTTCGACGCTCGGCAGGGTTCAAAGGAGCTCCACCGGTTGAGCAGAGAGCACCAACAAGAGAACTTCTTGGCAGTCCGTTCAAAACATTGCTACTTGTCCTCTGAATCGTATCACTTCCAATTAATTGTGATTTTTGATACAACTGATATTGTAGACCAGTATTAAAAAAAGCTTGTAGTGAATCGTGTGTTGTATTACGGTAGTTTGATGAGGAAGATGATGAAGATGTGGAATTGCCCGCTGAAGAGCTTGAAGTATTGTGAGCAGTTGTCAAGGCACGTCCACTTTGT

After that, this sequence is translated (NCBI Translat tool) and a ORF is chosen. Thechosen ORF is from 3’5’ because we search on the complementary strand :

M R I D L T E A R V Q V W F Q N R R A K Y R K Q E K I R R V K D E E E

D P L K K E P G Q I

Finally a BlastP is done. The protein is the homeobox protein unc-4. It is a transcriptionfactor (uniprot entry P29506) which could explain the little mRNA found in the blastn.

(( J’atteste que dans ce texte toute affirmation qui n’est pas le fruit de ma reflexion per-sonnelle est attribuee a sa source et que tout passage recopie d’une autre source est enoutre place entre guillemets. ))

Daniel Abegg

22