Bioinformatics & Computational Biologyhome.iitk.ac.in/~rsankar/courses/lec01.pdfDr. R. Sankar,...

Preview:

Citation preview

Dr. R. Sankar, BSE 633 (2020)

BSE 633A

Bioinformatics & Computational Biology

R. Sankararamakrishnan

Dr. R. Sankar, BSE 633 (2020)

References Bioinformatics: Sequence and Genome Analysis David W. Mount, Cold Spring Harbor Laboratory Press (2001)

Bioinformatics and Functional Genomics by Jonathan Pevsner, Wiley-Balckwell

Developing Bioinformatics Computer Skills. C. Gibas and P. Jambeck, O’ Reilly (2001)

Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Cambridge University Press (1998)

Journals: Bioinformatics, BMC Bioinformatics, Nucleic Acid Research, ISMB, J. Comp. Biol., PLoS Computational Biology

Dr. R. Sankar, BSE 633 (2020)

Instructors: Upto MidSem Exam: Dr. R. Sankar After MidSem Exam: Dr. Hamim Zafar

Dr. R. Sankar, BSE 633 (2020)

Quiz I – February first week: 5% Midsem: 30% Quiz II – April first week: 5% Assignment/Exercise: 10% Presentation: 5% End-semester exam: 40% Attendance: 5%

Course evaluation

Dr. R. Sankar, BSE 633 (2020)

Introduction to bioinformatics, biological databases and their growth, Concept of homology and definition of associated terms, pairwise sequence alignment, dotmatrix plot, dynamic programming algorithm, global (Needleman-Wunsch) and local (Smith-Waterman) alignments, BLAST Scoring matrices (PAM and BLOSUM families), gap penalty, statistical significance of alignment Multiple sequence alignment, Sum-of-pairs method, CLUSTAL W, Genetic Algorithm Pattern finding in protein and DNA sequencing, Gibbs Sampler, Hidden Markov Model, Profile construction and searching, PSI-BLAST Introduction to phylogeny, maximum parsimony method, distance method (neighbor-joining), maximum-likelihood method Gene prediction in prokaryotes and eukaryotes, homology and ab-initio methods

Genome analysis and annotation, comparative genomics

BSE633: Course Contents

Dr. R. Sankar, BSE 633 (2020)

Powerpoint presentation of each class and other course materials will be available at:

http://home.iitk.ac.in/~rsankar/courses/

Dr. R. Sankar, BSE 633 (2020)

What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

What is Computational Biology? - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

- NIH Definition http://www.bisti.nih.gov/

Dr. R. Sankar, BSE 633 (2020)

Nature (Oct. 2017)

Dr. R. Sankar, BSE 633 (2020)

Nature (Oct. 2017)

Dr. R. Sankar, BSE 633 (2020)

Nature (Oct. 2017)

Dr. R. Sankar, BSE 633 (2020) Nature (Oct. 2017)

Dr. R. Sankar, BSE 633 (2020)

The first protein was sequenced in 1953

Dr. R. Sankar, BSE 633 (2020)

Number of protein sequences today Source: UniProt Database www.uniprot.org

Swiss-Prot: 561,568 seqs TrEMBL: 179,250,561 seqs

11/Dec/2019

Dr. R. Sankar, BSE 633 (2020)

Myoglobin and Hemoglobin: First protein structures to be determined

Dr. R. Sankar, BSE 633 (2020)

Yearly growth of structures in PDB

http://www.pdb.org

159140 structures in PDB Date: 6/Jan/2020

Dr. R. Sankar, BSE 633 (2020)

1976: Bacteriophage MS2 – RNA Virus; 3569 bp PhiX174 – DNA virus; 5386 bp 1995: Haemophilus influenzae - bacteria; 1.8 m bp Methanococcus jannaschii – archaeon; 1.7 m bp 1996: Baker’s yeast; 12.1 m bp 1998: Caenorhabditis elegans; 100 m bp 2000: Arabidopsis thaliana; 119 m bp Drosophila melanogaster; 165 m bp 2001: Homo sapiens; 3.2 b bp 2002: Mouse; 3.48 b bp 2003: Mosquito; 278 m bp Japanese pufferfish; 390 m bp Rice: 374 m bp 2004: Chicken; 1 b bp 2005: Chimpanzee; 3.3 b 2010: Western clawed frog: 1.5 m bp 2013: Zebra fish; 1.5 b

http://www.yourgenome.org/facts/timeline-organisms-that-have-had-their-genomes-sequenced

Genome Sequencing: Important milestones

Dr. R. Sankar, BSE 633 (2020)

Number of genome sequences

http://gregoryzynda.com/ncbi/genome/python/2014/03/31/ncbi-genome.html

https://ark-invest.com/research/genome-sequencing

The genome sequencing market is in its infancy, poised to grow at rates difficult to comprehend. Sequencing is introducing deeper scientific knowledge into medical decision making, eliminating wasteful guess work, and moving us closer to a truly personalized healthcare system.

Dr. R. Sankar, BSE 633 (2020) http://www.internationalgenome.org/

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

598 sequences from India

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020) https://www.nlm.nih.gov/about/2020CJ.html

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

https://digitalworldbiology.com/blog/bio-databases-2018-how-do-they-taste

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

>gi|388480089|ref|YP_492284.1| transporter [Escherichia coli str. K-12 substr. W3110] MSGLKQELGLAQGIGLLSTSLLGTGVFAVPALAALVAGNNSLWAWPVLIILVFPIAIVFAILGRHYPSAG GVAHFVGMAFGSRLERVTGWLFLSVIPVGLPAALQIAAGFGQAMFGWHSWQLLLAELGTLALVWYIGTRG ASSSANLQTVIAGLIVALIVAIWWAGDIKPANIPFPAPGNIELTGLFAALSVMFWCFVGLEAFAHLASEF KNPERDFPRALMIGLLLAGLVYWGCTVVVLHFDAYGEKMAAAASLPKIVVQLFGVGALWIACVIGYLACF ASLNIYIQSFARLVWSQAQHNPDHYLARLSSRHIPNNALNAVLGCCVVSTLVIHALEINLDALIIYANGI FIMIYLLCMLAGCKLLQGRYRLLAVVGGLLCVLLLAMVGWKSLYALIMLAGLWLLLPKRKTPENGITT

A sample record in FASTA format

Dr. R. Sankar, BSE 633 (2020)

Genomic sequences

Single Nucleotide Polymorphisms (SNPs)

Protein amino acid sequences

Protein 3D structures

Gene Expression

Protein function

Biomolecular interactions and networks

Literature

Biological Data

Dr. R. Sankar, BSE 633 (2020)

Emergence of ‘Omes’ – The new ‘era’ in Biology

Transcriptome: the mRNA complement of an entire organism, tissue type, or cell

Metabolome: the totality of metabolites in an organism

Lipidome: the totality of lipids

Glycome: the totality of glycans, carbohydrate structures of an organism, a cell or tissue type

Interactome: the totality of the molecular interactions in an organism

Spliceome: the totality of the alternative splicing protein isoforms

Kinome: The totality of protein kinases in a cell

Foldome: Foldome is the totality of biological structures as skeletons

Dynome: Adding a 4th Dimension to the Protein Database by Terascale Simulation

Reactome: A knowledge base of biological processes

Dr. R. Sankar, BSE 633 (2020)

What is Bioinformatics?

A Proposed Definition and Overview of the Field N.M. Luscombe, D. Greenbaum and M. Gerstein

http://bioinfo.mbb.yale.edu/

Dr. R. Sankar, BSE 633 (2020)

What is Bioinformatics? Bioinformatics is conceptualizing biology in terms

of molecules (in the sense of physical chemistry) and applying “informatics techniques” (derived from applied maths, computer science and statistics) to understand and organize the information associated with these molecules on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications

Mark Gerstein, Yale University

Dr. R. Sankar, BSE 633 (2020)

Bioinformatics is an interdisciplinary field combining mathematical, statistical, and computer methods to analyze medical, biological, biochemical, and biophysical data.

Dr. R. Sankar, BSE 633 (2020)

Crystal Structure of ATP-gated P2X4 receptors

Nature July (2009)

Dr. R. Sankar, BSE 633 (2020)

Nature July (2009)

Dr. R. Sankar, BSE 633 (2020)

Nature July (2009)

Dr. R. Sankar, BSE 633 (2020)

What are we going to learn in this course?

How to compare two sequences?

How to compare many sequences?

How to evaluate an alignment?

What are the limitations?

Phylogenetic analysis Prediction of genes from a given genomic sequence

Comparative genomics

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

Species & Speciation Species: Group of populations, have similar appearance

Successfully interbreed

Reproductively isolated from other species

Gene flow occurs, genetically distinctive and isolated from other species

Speciation The formation of two groups of organisms that are reproductively isolated from each other and thus have no gene flow.

When there is no gene flow, the 2 groups will accumulate more and more differences over time.

Dr. R. Sankar, BSE 633 (2020)

Gene Duplication

•A redundant duplicate of a gene may acquire divergent mutations and eventually emerge as a new gene.

•Gene duplication is one of the means by which a new gene can arise.

•It is one of a only a few ways to increase the amount of genetic material.

•One of the means to create new function

Dr. R. Sankar, BSE 633 (2020)

Why should we do sequence alignments?

Useful for discovering functional, structural and evolutionary information in biological sequences

Sequences that are very much alike probably have the same function and 3D structure in the case of proteins

If two sequences from different organisms are similar, there may have been a common ancestor sequence

The sequences are then defined as homologous

Dr. R. Sankar, BSE 633 (2020)

Hemoglobin

Dr. R. Sankar, BSE 633 (2020)

_________ Rat_gene_1

Rat |

________X

| |_________ Rat_gene_2

|

---( )

| _____________ Mouse_gene_1

| |

|____X

Mouse |_____________ Mouse_gene_2

Two genes are to be orthologous if they diverged after a speciation event, Two genes are to be paralogous if they diverged after a duplication event.

Orthologous and paralogous genes

http://www.icp.ucl.ac.be/~opperd/private/orthol.html

Dr. R. Sankar, BSE 633 (2020)

Types of Homology

Dr. R. Sankar, BSE 633 (2020)

Chymotrypsin Subtilisin

Dr. R. Sankar, BSE 633 (2020)

Analogous genes

Similar regions in sequences may not have a common ancestor but may have arisen independently by evolutionary pathways converging on the same function

This is called convergent evolution

Such gene/protein sequences are referred to as analogous

Dr. R. Sankar, BSE 633 (2020)

Certain infectious agents, such as retroviruses, or species hybridization can introduce foreign DNA into the genome of an organism.

Once introduced, these sequences become part of the genome passed between generations, but the sequence has its origins elsewhere

Such sequences are called xenologues or xenologous sequences

Xenologoues

Dr. R. Sankar, BSE 633 (2020)

Sequence Alignment - Example

Dr. R. Sankar, BSE 633 (2020)

Sequence Alignment - Definition

•Procedure of comparing two or more sequences

•Search for a series of individual characters or character pattern that are in the same order

•Two sequences are aligned by writing them across a page in two rows

•Identical/similar characters are placed in the same column

•Nonidentical characters are placed in the same column as a mismatch or opposite a gap in the other sequence

•Optimal alignment: mismatches and gaps are placed to bring as many identical and similar characters as possible

Dr. R. Sankar, BSE 633 (2020)

PRA isomerase: IGP synthase

Dr. R. Sankar, BSE 633 (2020)

>1PII:_|PDBID|CHAIN|SEQUENCE MQTVLAKIVADKAIWVEARKQQQPLASFQNEVQPSTRHFYDALQGARTAFILECKKASPSKGVIRDDFDPARIAAIYKHYASAISVLTDEKYFQGSFNFLPIVSQIAPQPILCKDFIIDPYQIYLARYYQADACLLMLSVLDDDQYRQLAAVAHSLEMGVLTEVSNEEEQERAIALGAKVVGINNRDLRDLSIDLNRTRELAPKLGHNVTVISESGINTYAQVRELSHFANGFLIGSALMAHDDLHAAVRRVLLGENKVCGLTRGQDAKAAYDAGAIYGGLIFVATSPRCVNVEQAQEVMAAAPLQYVGVFRNHDIADVVDKAKVLSLAAVQLHGNEEQLYIDTLREALPAHVAIWKALSVGETLPAREFQHVDKYVLDNGQGGSGQRFDWSLLNGQSLGNVLLAGGLGADNCVEAAQTGCAGLDFNSAVESQPGIKDARLLASVFQTLRAY > PRAI sequence GENKVCGLTRGQDAKAAYDAGAIYGGLIFVATSPRCVNVEQAQEVMAAAPLQYVGVFRNHDIADVVDKAKVLSLAAVQLHGNEEQLYIDTLREALPAHVAIWKALSVGETLPAREFQHVDKYVLDNGQGGSGQRFDWSLLNGQSLGNVLLAGGLGADNCVEAAQTGCAGLDFNSAVESQPGIKDARLLASVFQTLRAY

Dr. R. Sankar, BSE 633 (2020)

Global and Local alignment Global alignment: Align the entire sequence

Sequences that are quite similar and approximately the same length are suitable candidates

Local alignment: Stretches of sequence with the highest density of matches are aligned

One or more islands of matches or subalignments are generated

More suitable for sequences that are similar along some of their lengths and dissimilar in others

that differ in length

that share a conserved region or domain

Dr. R. Sankar, BSE 633 (2020)

Dr. R. Sankar, BSE 633 (2020)

Exercise 1 Get the UniProt (www.uniprot.org) Accession ID for the protein whose PDB ID is 1BL8. Go to the corresponding UniProt entry. Find out the databases which are cross-linked. What are the related databases from which you can extract information about this protein?