46
CLOUD BIOINFORMATICS BY.A.ARPUTHA SELVARAJ

Cloud bioinformatics 2

Embed Size (px)

DESCRIPTION

GENE

Citation preview

Page 1: Cloud bioinformatics 2

CLOUD BIOINFORMATICS

BY.A.ARPUTHA SELVARAJ

Page 2: Cloud bioinformatics 2

Bioinformatics is…

The development of computational methods for studying the structure, function, and evolution of genes, proteins, and whole genomes;

The development of methods for the management and analysis of biological information arising from genomics and high-throughput biological experiments.

Page 3: Cloud bioinformatics 2

3

Why is there Bioinformatics?

Lots of new sequences being added- Automated sequencers- Genome Projects- Metagenomics- RNA sequencing, microarray studies, proteomics,…

Patterns in datasets that can be analyzed using computers

Huge datasets

Page 4: Cloud bioinformatics 2

4

• Gramicidine S (Consden et al., 1947), partial insulin sequence (Sanger and Tuppy, 1951)

• 1961: tRNA fragments

• Francis Crick, Sydney Brenner, and colleagues propose the existence of transfer RNA that uses a three base code and mediates in the synthesis of proteins (Crick et al., 1961) General nature of genetic code for proteins. Nature 192: 1227-1232. In Microbiology: A Centenary Perspective, edited by Wolfgang K. Joklik, ASM Press. 1999, p.384

• First codon assignment UUU/phe (Nirenberg and Matthaei, 1961)

Need for informatics in biology: origins

Page 5: Cloud bioinformatics 2

5

• The key to the whole field of nucleic acid-based identification of microorganisms…

…the introduction molecular systematics using proteins and nucleic acids by the American Nobel laureate Linus Pauling.

Zuckerkandl, E., and L. Pauling. "Molecules as Documents of Evolutionary History." 1965. Journal of Theoretical Biology 8:357-366

• Another landmark: Nucleic acid sequencing (Sanger and Coulson, 1975)

Need for informatics in biology: origins

Page 6: Cloud bioinformatics 2

6

Need for informatics in biology: origins

• First genomes sequenced: – 3.5 kb RNA bacteriophage MS2 (Fiers et al.,

1976)– 5.4 kb bacteriophage X174 (Sanger et al.,

1977)– 1.83 Mb First complete genome sequence of a free-living

organism: Haemophilus influenzae KW20 (Fleischmann et al., 1995)

– First multicellular organism to be sequenced: C. elegans (C. elegans sequencing consortium, 1998)

• Early databases: Dayhoff, 1972; Erdmann, 1978

• Early programs: restriction enzyme sites, promoters, etc… circa 1978.

• 1978 – 1993: Nucleic Acids Research published supplemental information

Page 7: Cloud bioinformatics 2

7(from the National Centre for Biotechnology Information)

Genbank and associated resources doubles faster than Moore’s Law! (< every 18 months)

http://en.wikipedia.org/wiki/Moore’s_law

Page 8: Cloud bioinformatics 2

8

Today: So many genomes…

As of mid-August 2010, according to the GOLD GenomesOnline database….

• Eukaryotic genome projects are in progress?(Genome and ESTs) 1548 (517 - 5 years ago)

• Prokaryote genome projects are in progress?5006 (740 - 5 years ago)

• Metagenome projects are in progress?133 (Zero - 5 years ago)

TOTAL 6687 projects (As of Sept 2011: >10,000)

Page 9: Cloud bioinformatics 2
Page 10: Cloud bioinformatics 2
Page 11: Cloud bioinformatics 2
Page 12: Cloud bioinformatics 2
Page 13: Cloud bioinformatics 2
Page 14: Cloud bioinformatics 2

14

The Human Genome

The genome sequence is complete - almost!– approximately 3.5 billion base pairs.

Page 15: Cloud bioinformatics 2

15

Work ongoing to locate all genes and regulatory regions and describe their functions… …bioinformatics plays a critical role

Page 16: Cloud bioinformatics 2

16

Identifying single nucleotide polymorphisms (SNPs) and other changes between individuals

Page 17: Cloud bioinformatics 2

17

Bioinformatics helps with……. Sequence Similarity Searching/Comparison

What is similar to my sequence? Searching gets harder as the databases

get bigger - and quality changes Tools: BLAST and FASTA = early time

saving heuristics (approximate methods) Need better methods for SNP analysis! Statistics + informed judgment of the

biologist

Page 18: Cloud bioinformatics 2

18

Bioinformatics helps with…….Structure-Function Relationships

Can we predict the function of protein molecules from their sequence?sequence > structure > function

Prediction of some simple 3-D structures possible (a-helix, b-sheet, membrane spanning, etc.)

Page 19: Cloud bioinformatics 2

19

Can we define evolutionary relationships between organisms by comparing DNA sequences?- Lots of methods and software,

what is the best analysis approach?

Bioinformatics helps with…….Phylogenetics

Page 20: Cloud bioinformatics 2

WHAT IS NEXT GENERATION SEQUENCING (NGS)?

Page 21: Cloud bioinformatics 2

Sanger (“dideoxy sequencing or chain termination”) Sequencing

Single stranded DNA from sample* extended by polymerase from primer then randomly terminated by dideoxy nucleotide (ddNTP)

Variable length DNA fragments radiolabelled or fluorescently detected ddNTP

*sample derived from amplified cDNA, genomic clones or whole genome shotgun

Page 22: Cloud bioinformatics 2

Sanger Pro’s & Con’s

• Advantages– Relatively accurate– Relatively long (500 –

1500) bp reads

• Disadvantage– Relatively costly in terms

of reagents and relatively low throughput

Page 23: Cloud bioinformatics 2

Next Generation Sequencing (NGS)

Sequence Assembly

on HPC

Roche 454

Life Tech. Ion Torrent

Illumina HiSeq

Life Tech SOLiD Oxford Nanopore “GridION”

Polonator

HeliScope

Pacific Biosciences

SMRT Cell

Page 24: Cloud bioinformatics 2

(General) NGS Pro’s & Con’s

• Advantages– Very high throughput– Very cheap data

production

• Disadvantages– Relatively short reads– Relatively higher error

rates– Bioinformatics of

assembly is much more challenging

Page 25: Cloud bioinformatics 2

General Workflow

1. Template preparation2. Sequencing & imaging3. Genome alignment/assembly

Page 26: Cloud bioinformatics 2

COPING WITH THE BIOINFORMATICS CHALLENGE

Page 27: Cloud bioinformatics 2

Challenge

Assembling “next generation sequence” (NGS) data requires a great deal of computing power and gigabytes memory

Software often can execute in parallel on all available computer processing unit (CPU) cores.

Many functional annotation processes (e.g. database searching, gene expression statistical analyses) also demand a lot of computing power

Page 28: Cloud bioinformatics 2

“High Performance Computing” and “Cloud

Computing”

Computer Nodes

Network Storage

Your local workstation/

laptop

Page 29: Cloud bioinformatics 2

What is Cloud Computing? Pooled resources: shared with many

users (remotely accessed) Virtualization: high utilization of

hardware resources (no idling) Elasticity: dynamic scaling without

capital expenditure and time delay Automation: build, deploy, configure,

provision, and move without manual intervention

Metered billing: “pay-as-you-go, only for what you use

Cloud Computing

Page 30: Cloud bioinformatics 2

Cloud Bioinformatics Module

Raw Data/Results/

Snapshots

Task-Specialized

Server

Input Job Message Queue

Output Job Message Queue

Job StatusNotification

CustomizedMachine

Image

Start-up(w/parameters)

Page 31: Cloud bioinformatics 2

A More Complete Picture…

Raw Data + Results

WebPortal

ProjectRelationalDatabase

DatabaseLoader

Page 32: Cloud bioinformatics 2

Case Study in Bioinformatics on the Cloud

Used Amazon Web Serviceshttp://aws.amazon.com

Assembled ~99 raw NGS transcriptome sequence datasets from 83 species, on 16 Amazon EC2 instances with 8 CPU cores, 68 GB of RAM, ~200 hours of computer time, total run in less than one working day.

Each single machine of the required size would likely have cost at least ~$10,000 (and time) to purchase, and incur significant operating costs overhead (machine room space, power supplies, networking, air conditioning, staff salaries, etc.)

The above run could be started up in a few minutes and cost ~ $500 to complete. Once done, no machines left idling and unused…

Page 33: Cloud bioinformatics 2

Software for (NGS) Bioinformatics

Bundled with sequencing machines:e.g. Newbler assembler with Roche 454

3rd party commercial:DNA Star (www.dnastar.com)Geneious (http://www.geneious.com/)GeneWiz (http://www.genewiz.com)And others…

Open Source:Lots (selected examples to be covered in this

workshop)

Page 34: Cloud bioinformatics 2

What do I need to run bioinformatics software locally?

Some common bioinformatics software is platform independent, hence will run equally under Windows and UNIX (Linux, OSX)

Most other software targets Unix systems. If you are running Microsoft Windows and want to run such software locally, the easiest way to do this(?) is to install some version of Linux (suggest “Ubuntu”) as a dual boot or (less intrusively) as a guest operating system in a virtual machine, e.g.http://www.vmware.com/products/player/

Page 35: Cloud bioinformatics 2

But, what are *we* going to use here?

Page 36: Cloud bioinformatics 2

WestGrid @ SFU / IRMACSWestGrid is a consortium member of “Computer

Canada”https://computecanada.org/

“bugaboo” cluster: 4328 cores total: 1280 cores, 8 cores/node, 16 GB/node, x86_64, IB. Plus 3048 cores, 12 cores/node, 24GB/node, x86_64, IB. capability cluster, 40 Core Years

Access to other Westgrid resources through LAN and WAN

More details from Brian Corrie tomorrow…

Page 37: Cloud bioinformatics 2

Galaxy Genomics Workbench

http://galaxy.psu.edu/(also http://main.g2.bx.psu.edu/)

Page 38: Cloud bioinformatics 2

THE ROADMAP

Page 39: Cloud bioinformatics 2

What is Bioinformatics?

Road Map

AnnotationSequences(Formats)

Visualization of Sequence & Annotation

Search &Alignments

NGS

SequenceDatabases

Sequence Assembly

Page 40: Cloud bioinformatics 2

Specific Applications

Sequence Assembly of Transcriptomes Sequence Assembly of Whole Genomes Annotation of de novo Assembled Sequences Identification and Analysis of Sequence Variation Comparative Genomic Analysis and Visualization Meta-Analysis of Annotated Sequence Data

Page 41: Cloud bioinformatics 2

Survey: Workshop Expectations I How to find significance in the huge amount of

data that Next Gen sequencing, but also microarrays etc. generate.

A basic understanding of how to analyse next generation sequencing data.

Learn some hands-on computer experience learning to use software for analysing sequence

data; what can be done and how to do it. genome assembly + meta-analysis

Page 42: Cloud bioinformatics 2

Survey: Workshop Expectations II The basics of alignment and SNP calling with next-gen

sequencing, and what kind of programs are out there to do these tasks and then analyze the large datasets (I've been trying to figure this out on my own through reading the literature and it's quite time consuming so any info provided through the workshop would be very helpful - thanks)

The main workflow for processing sequence data from the beginning to the more specific paths of analyses. Also the concepts, significance of the adjustable parameters behind the various algorithms used in the workflow.

Page 43: Cloud bioinformatics 2

Survey: Workshop Expectations III I expect to learn the basic bioinformatics tools. Learn different sequence alignment

software/technologies (i.e. BWA, Abyss, etc.). Learn more about the complexities of NGS sequencing

Next generation sequencing, data analysis etc. Parameters regulating assembly of contigs. How to take

raw data to an assembly, control the main parameters for assembly, mass analyze data for annotation and SNPs

How to compare expression profiles using RNA transcriptomes.

Want to learn new things

Page 44: Cloud bioinformatics 2

Survey: Operating System Being Used

Microsoft Windows on Intel/AMD – 14 (86.7%) Most running Windows 7 (some XP & Vista) One uses Linux through Westgrid and the IRMACS

clusterSome of you also thinking of running Linux

Apple OS X – 2 (13.3%) Snow Leopard Release Apple Lion, running Windows 7 using Parallels

Linux on Intel - 2 (13.3%)

Page 45: Cloud bioinformatics 2

Looking Ahead…What will you need for this workshop?

Mainly, just a laptop running a web browser(Optional) access to Linux/Unix locally (VM Player)

Reading list:Will give review citations for future lecturesFor next week, suggest that you surf to

http://www.ncbi.nlm.nih.gov/

Page 46: Cloud bioinformatics 2

Thank You