Bioinformatics and computational toolsdels.nas.edu/resources/static-assets/bls/miscellaneous/De_Villiers.pdfBioinformatics and computational tools ... • Bioinformatics is the application

Beijing, 31 October─3 November 2010

Bioinformatics and computational tools

Etienne P. de Villiers (PhD)

International Livestock Research InstituteNairobi, Kenya


International Livestock Research InstituteNairobi, Kenya

• ILRI works at the crossroads of livestock and poverty, bringing high‐quality science and capacity building to bear on poverty reduction and sustainable development.

• It is one of 15 centers supported by the Consultative Group on International Agricultural Research (CGIAR) that conduct food and environmental research to help alleviate poverty and increase food security.

• ILRI biotech facilities:– Molecular biology Laboratories (>6,000 sqm)– State‐of‐the‐art biosciences equipment

• 2 ABI sequencers (3130, 3730)• 1 Roche 454 GS FLX

– Bioinformatics unit• 64 CPU high performance compute cluster

– BSL3 laboratory– Flow cytometry and microscopy– Diagnostics (nucleotide‐ and protein‐based)– Vaccine technology/Immunology– Small and Large Animal units

http://www.ilri.org http://hub.africabiosciences.org/


Central dogma of molecular biology


Bioinformatics

• Bioinformatics is the application of information technology and computer science to the field of molecular biology.


Bioinformatics

ACGGTGCGTAACGTCAGTCAGGTCAGTCAGGenome (DNA) sequence

Bioinformatics or computational biology

Gene or proteinproperties

Comparative analysisProtein structure and function prediction


The Sequencing RevolutionNext Generation Sequencing

High‐Throughput Sequencing2000

High‐Throughput Sequencing2010

2.6 million sequences per hour96 sequences per hour


The Sequencing RevolutionThird Generation Sequencing

Oxford NanoporePacific Biosciences

Single Molecule sequencing

~3,000 wells per chip1,500 bp per well10 bp per second

$1,000 human genome


Sequencing the Human Genome

Year

Log 1

0(price)

2010: 5K$, a few days

2009: Illumina40-50K$

201020052000

10

8

6

4

2

2008: ABI SOLiD60K$, 2 weeks

2007: 4541M$, 3 months

2001: Celera100 Million $ 3 years

2001: Human Genome Project3 billion $, 11 years


Next Generation SequencingCurrent Projects

• 1000 Genomes project (www.1000genomes.org)

– Sequence genomes from 2500 people from divers backgrounds to 4x coverage to identify human genetic variation.

• Ensembl genomes (www.ensemblgenomes.org)

– 234 species sequenced from mammalians, birds to parasites.

– >400 bacterial species sequenced.

• Plant genomes – 18 sequenced (www.phytozome.org/)

• BGI (China) (www.genomics.cn) ‐ 1,000 plant and animal reference genome project.


2010Intel iCore7 desktop$1,000

1998Sun HPC1000$1,000,000

1988Cray YMP $40,000,000

GigaFLO

PS

1

3

140

Cost of Computing

10


World Internet Connections


Cloud Computing

• Cloud computing is a general term for computation‐as‐a‐service.

• Computation‐as‐a‐service means that customers rent the hardware and the storage only for the time needed to achieve their goals

• Amazon Elastic Compute Cloud (Amazon EC2) provides resizable compute capacity in the cloud including,

• High Performance computing (HPC) on demand• 23 GB of memory

• 64 Compute nodes

• 1.7 Terabytes storage

• $1.60 per hour or

• $5,000 per year


Distributed computing

The client uses the spare CPU cycles on a user’s computer to run the simulation algorithm on the assigned structure.

Results are automatically returned and exchanged for a new work unit on a daily basis. lab/office…home… anywhere

A central server sends and receives the work units (essentially just protein structures and sequences).

• Distributed computing is any computing that involves multiple computers remote from each other.


Distributed computingFolding @home

• Understand how existing proteins attain their specific, functional three‐dimensional structures.

• Use distributed computing through installation of “screensaver” on user computer.

• In 2009 was running on 40,000 CPUs or 5 PFLOPS• Fastest standalone supercomputer is "Tianhe‐1A” at 2.5

PFLOPS


Metagenomics

• Metagenomics is the sequencing and analysis of DNA of organisms recovered from an environment, without the need for culturing them using next generation sequencing technologies.

– The Sargasso Sea community survey– Acid mine drainage film– Human gut communities– Symbiotic community from marine worm– AVID project

Organisms


Compilation of complete genomes, metagenomes,

annotation and curation of metadata

(meta)genome sequencing

Extraction of important biological

information

Vaccine dvlpmt

Diagnostics

Global diseases surveillance

Drug dvlpmt

Better control tools

geographical mapping

sequence variation analysis

Primer,microarray

phylogeneticanalysis

proteinmodelingDatabases

Improved drug selection

From Sequence (genomics/metagenomics) to impact

discovery of new micro‐

organisms and pathways

Environmentalsustainability


AVIDArbovirus Incident & Diversity project

• Google.org ‐ Predict and Prevent funded project.

• Pilot project on Rift Valley Fever virus.• virus is transmitted by mosquitoes and infect both animals and humans

• deadly to both humans and livestock

• outbreaks occur every 5‐6 years

A complex mix of species, sub species, populations.

Can we understand its dynamics ?


AVID ‐ Questions

• “Where” is the virus (between “outbreaks”) ?– Environment– Vectors– Reservoirs

• What is the diversity of ?– Virus– Vector– Reservoir

• And how do these interact ?• Distribution of other pathogens ?• Novel pathogens and variants ?

For example:Does a particular virus variant occur in a particular vector variant associated with a particular mammalian variant ?

Viral Geneflow


AVID ‐ Strategy

• Samples are collected in specific areas:– Human blood, livestock, wildlife, mosquitoes, water, soil

• Each sample collected with a full meta data description (location, date/time, eco‐geo‐socio descriptors).

• Amplify sequences from multiple points on multiple possible genomes – virus, insect, mammal, others.

• Sequence these amplicons simultaneously from 1,000s of samples using next generation sequencing.

• Analyse sequences – look for distribution and co‐occurrence.• Refine primers for a ‘simple’ (RT)‐PCR approach.• Move diagnostic sequences on to high throughput PCR

diagnostics.



AVID – Data management and BioBANK

• Data management is one of the biggest challenges. – The project cannot achieve its goals without great data

integration.

• All samples are biobanked with full data descriptors

• Opportunity to share samples across projects ?

• Wildlife samples are very expensive and everyone is collecting them for their own purposes!!


Thank You

Documents

Bioinformatics and computational toolsdels.nas.edu/resources/static-assets/bls/miscellaneous/De_Villiers.pdfBioinformatics and computational tools ... • Bioinformatics is the application