25
Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presenta tion

Bioinformatics Basics

  • Upload
    tao

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics Basics. Cyrus Courtesy from LO Leung Yau’s original presentation. Outline. Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression Bioinformatics Sequence Analysis Phylogentic Trees Data Mining. Biological Background – Cell. - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics Basics

Bioinformatics Basics

CyrusCourtesy from LO Leung Yau’s original presentation

Page 2: Bioinformatics Basics

Outline

Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression

Bioinformatics Sequence Analysis Phylogentic Trees Data Mining

Page 3: Bioinformatics Basics

Biological Background – Cell

Basic unit of organisms Prokaryotic Eukaryotic

A bag of chemicals Metabolism controlled

by various enzymes Correct working needs

Suitable amounts of various proteins

Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

Page 4: Bioinformatics Basics

Biological Background – Protein Polymer of 20 types of

Amino Acids Folds into 3D structure Shape determines the

function Many types

Transcription Factors Enzymes Structural Proteins …

Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid

Page 5: Bioinformatics Basics

Biological Background – DNA & RNA DNA

Double stranded Adenine, Cytosine, Guani

ne, Thymine A-T, G-C Those parts coding for pr

oteins are called genes RNA

Single stranded Adenine, Cytosine, Guani

ne, Uracil

Picture taken from http://en.wikipedia.org/wiki/Gene

Page 6: Bioinformatics Basics

Biological Background – Genes Genes – protein coding regions

3 nucleotides code for one amino acid

There are also start and stop codons

Page 7: Bioinformatics Basics

Biological Background—in a nutshell Abstractions

Functional Units: Proteins

Templates: RNAs

Blueprints: DNAs

Templates: RNAs

Blueprints: DNAs

Not only the information (data), but also the control signals about what and how much data is to be sentProteins (TFs) so help

Page 8: Bioinformatics Basics

Biological Background – Sequences Abstractions

Sequences

acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctactggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatactggatacagggcatataaaacaggggcaaggcacagactc

FT intron <1..28FT /gene="CREB"FT /number=3FT /experiment="experimental evidence…FT recorded"FT exon 29..174FT /gene="CREB"FT /number=4FT /experiment="experimental evidence…FT recorded"FT intron 175..>189FT /gene="CREB"FT /number=4

Annotations

Visualizations

Page 9: Bioinformatics Basics

Biological Background – DNA RNA Protein

Picture taken from http://en.wikipedia.org/wiki/Gene

gene

Page 10: Bioinformatics Basics

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Page 11: Bioinformatics Basics

Complex Interactions between Genes, TFs and TFBSs

Page 12: Bioinformatics Basics

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Page 13: Bioinformatics Basics

Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C

pairing Can monitor expression

of many genes

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

Page 14: Bioinformatics Basics

Gene Expression Microarray Data

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

Genes

Time points/Condiditions

Colors: Expression (RNA) Levels

Page 15: Bioinformatics Basics

Bioinformatics—Sequence Analysis Alignments

a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

http://en.wikipedia.org/wiki/Sequence_alignment

Page 16: Bioinformatics Basics

Bioinformatics—Sequence Analysis Pair-wise alignments

Method: dynamic programming!

No penalty for the consecutive ‘-’s before and after the sequence to be aligned

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures

Page 17: Bioinformatics Basics

Bioinformatics—Sequence Analysis Multiple (global) sequence alignment

Also dynamic programming (but can’t scale up!)

Page 18: Bioinformatics Basics

Bioinformatics—Sequence Analysis Multiple local sequence alignment

i.e. Motif (pattern) discovery

>seq1acatggccgatcagctggtttttgtgtgcctgtttctgaatc>seq2ttctattttacgtaaatcagcttgaacatgtacctactggtg>seq3atgcacctttgatcaataccagctagacaaacgtgtgttg>seq4agtccaaagatcagggctggctgaatactggatcagct>seq5cagctacagggcatataaaggggcaaggcacagactc

Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes).

TFBSs are the controlling key holes in gene regulation!

Page 19: Bioinformatics Basics

DNA motifs

Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to

make genes functioning Expensive and time-consuming to try a large set of candidates in biological

experiments

Transcription

RNA

Translation

Protein

TATAA

TFBS (controlling)

Gene(functioning)

TF

Transcription Factor

DNA

Page 20: Bioinformatics Basics

Motif discovery

CGATTGAf

Similar controlled functionse.g. cancer gene activities

Maximized

TFBS Motif Discovery

SNP (single nucleotide polymorphism) Motif Discovery

DNA from different people

Normal

Disease!

AA

A

C

CC

TTT

G

GG

A T

C G

……

f NormalDisease!

distinguish

Maximized

Page 21: Bioinformatics Basics

Bioinformatics—Data mining

Classification To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the

key points and get some answer Training—your practice of your thinking manner with

answers known Validation—mock quiz to evaluate what you’ve learnt from

the training Testing—your examination!

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf

Underfitting & Overfitting

Page 22: Bioinformatics Basics

TRANSFAC Project

TF-Transcription Factors, important regulatorsTFBS-Transcription Factor Binding Site, major regulatory elementsTRANSFAC-The most representative DB for TFs and TFBSs

Modeling: statistical models, representations, Markov chains; Discovery: stochastic searching, indexing (suffix trees)

1

Relationship: TF-TFBS; TFBS-Gene… (understanding, prediction)Mining: text mining, approximate matching

2

Annotations: accurate wet-lab candidates (reduced labor and costs);Computation: large scale data processing; parallel computing

3

Representative Publications

[1] Gang Li, Tak-Ming Chan, Kwong-Sak Leung and Kin-Hong Lee, A Cluster Refinement Algorithm for Motif Discovery, IEEE/ACM Transaction on Computational Biology and Bioinformatics (accepted)

[2] Tak-Ming Chan, Kwong-Sak Leung, Kin-Hong Lee, TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics, 2008, 24(3), pp. 341-349

Page 23: Bioinformatics Basics

Bioinformatics—Data mining

Evaluation (scores!) Confusion Matrix Binary Classification

Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV …

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf

FNFPTNTP

TNTP

FNTP

TP

FPTP

TP

FPTN

TN

Page 24: Bioinformatics Basics

Bioinformatics—Data mining

Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alar

ms (FP)

Page 25: Bioinformatics Basics

Not The End

Your corresponding tutor will have more project-specific stuff to tell you

Thanks Q & A