58
prbb technical seminars Introduction to Python for bioinformatics Giovanni MarcoDall'Olio Unidad de Biologia Evolutiva – CEXS Barcelona (Spain)

Introduction to python for bioinformatics

Embed Size (px)

DESCRIPTION

A talk for the PRBB Technical Seminars series (http://bg.imim.es/technical-seminars/) on python and bioinformatics

Citation preview

Page 1: Introduction to python for bioinformatics

prbb technical seminars

Introduction to Pythonfor bioinformatics

Giovanni MarcoDall'OlioUnidad de Biologia Evolutiva – CEXS

Barcelona (Spain)

Page 2: Introduction to python for bioinformatics

Python

A programming language released in 1991 by Guido Van Rossum

Used for a variety of applications, from scripting to web programming

Adopted by google, yahoo, youtube, CERN, Nasa, Red Hat....

Lots of jokes in the documentation (it is named after the Monty Pythons)

Page 3: Introduction to python for bioinformatics

Python and bioinformatics

Python is widely used in bioinformatics

August 2007 survey - www.bioinformaticszen.com www.bioinformatics.org survey

Page 4: Introduction to python for bioinformatics

Python – overall view

Learning curve ☺☺☺☺☺ Easy to learn, yet powerful

Readibility of a python program

☺☺☺☺☺

Community, availability of open source modules

☺☺☺☺ (for bioinformatics, CPAN is sligthly bigger)

Programming paradigms

☺☺☺☺☺ Multi paradigm (Object Oriented, structured, functional, etc..)

Execution speed Interpreted language;importance of programmer effort over computer effort

Notes:

This talk is full of tables like this

They only reflect my opinion (biologist with 3-4 years experience)

Page 5: Introduction to python for bioinformatics

Python – Cons

State of open source libraries for bioinformatics

There is good support, but less compared to perl and R

Execution speed Comparable to perl, java, ruby, ..

SOAP libraries SOAPpy is very old, suds is the best one

Population Genetics modules

As many other specific modules, perl and R are better supported

Lack of true multithreading support

A structural limit make it impossible to have real multithreading in python (various solutions..)

= very sad = fine

Page 6: Introduction to python for bioinformatics

Python – what makes me happy

General syntax ☺☺☺☺☺People are forced to write program similar to yours

☺☺☺☺☺

Quicker to write programs

☺☺☺☺☺

Object Oriented, multi-paradigm

☺☺☺☺☺ (will be explained later)

Testing support ☺☺☺☺ ''

Page 7: Introduction to python for bioinformatics

Python – learning curve

Python's syntax is easy You can concentrate on algorithms and problems

instead of the programming language

Page 8: Introduction to python for bioinformatics

Python – learning curve

Python's syntax is easy So you can concentrate on algorithms and problems

instead of the programming language With python you don't have to worry of:

Learning strange symbols (~=, <>, eq, '\n', {}...) Alternative syntaxes to do the same task Declaring variables Inner structure of strings/arrays Low level IO, passing variables per reference/value, etc..

Page 9: Introduction to python for bioinformatics

Example of python code

#!/usr/bin/env python'''Some python examples'''

# example 1: a 'for' loopfor name in ('Albert', 'Aristoteles', 'Archimedes'):

print 'hello, ', name

# example 2: Opening a file and parsing itfilehandler = open('samplefile.txt', 'r')for line in filehandler.readlines():

if line.startswith('>'):print line

else:pass

Page 10: Introduction to python for bioinformatics

Python syntax I - indentation

In python, the indentation ( = spaces at the beginning of the line) is part of the syntax.

It is used to delimit loops and conditions, instead of graph parenthesis ({})

Example:

The first 'print' is inside the cycle, while the second is outside

for name in ('Albert', 'Aristoteles', 'Dayhoff'):print 'hello, ', name

print 'and hello to you, too'

Page 11: Introduction to python for bioinformatics

A quick perl/python comparison

#!/usr/bin/env python

a = 3

if a == 3:print 'a is eq to 3'

#!/usr/bin/env perl

my $a = 3;

if ($a == 3){ print "a is eq to 3\n";}

(Python) (Perl)

Python code is usually easier to read and contains less symbols (like {})

Page 12: Introduction to python for bioinformatics

Python syntax II - simplicity

Python has the minimal number of syntax keywords.

There is: only one way to open files (no 'fopen', 'openf', etc..)

only one to print (no printn, printf, sprintf, sprint, etc..)

only two ways to define loops ('for' and 'while').

Python's phylosophy is about simplicity.

Your colleagues are forced to write their programs in the same way as you.

Page 13: Introduction to python for bioinformatics

Python syntax III – declaring var

You don't need to declare variables The type of a variable is defined the first time

you assign a value to it

a = 'cacagtcaga' → a is a string

b = 133 → b is an integer

c = True → c is a boolean

Page 14: Introduction to python for bioinformatics

Notes on Python's speed

Python is an interpreted language its speed is at the level of perl, java, etc. programs are slower than C, but it's faster to write

them importance of programmer effort over computer

effort

Many ways to speed up python modules can also be written in C some compilers exist (PyPy) Google is working on an enhanced version of

python (news of March 2009).

Page 15: Introduction to python for bioinformatics

Python – programming goodies

Installation and portability

☺☺☺☺☺ Installed by default in most linux distribution, interpreted

IDLE / text editors

☺☺☺☺☺ Interactive shell, ipython, many editors

Install and search new modules

☺☺☺ easy_install, PyPI

Testing support ☺☺☺☺☺ doctest, unittest, nose

Writing documentation

☺☺☺☺☺

Debugging Logging, pdb

Page 16: Introduction to python for bioinformatics

Python – installation and portability

Python comes installed by default in most of the GNU/Linux distributions Mac users have an old version (2.5), but can

upgrade it On windows, you need to dowload an installer from

www.python.org first

Being an interpreted language, python programs are easy to port in other platforms

Page 17: Introduction to python for bioinformatics

PyPI (Python Package Index)

PyPI (repository of public python modules) pypi.python.org

PyPI is a repository of open source modules for python

For bioinformatics, it is smaller than to CPAN, CRAN/bioconductor, etc..

Page 18: Introduction to python for bioinformatics

Python – installing new modules

Modules can be automatically downloaded and installed using a tool called 'easy_install'

Examples: easy_install -U biopython # install or update

biopython from PyPI easy_install --prefix ~/usr biopython # install biopython

without requiring admin privilegies easy_install biopython.tar.gz # install biopython from a

previously downloaded tar ball easy_install http://www.biopython.org/install # install

biopython from its web site

Page 19: Introduction to python for bioinformatics

Using python

Python can be used as an interactive shell (like R, octave, matlab, etc..) or by writing programs

gioby@dayhoff:~$ python

>>>>>> print 'hola''hola'

>>> range(5)[0, 1, 2, 3, 4]

gioby@dayhoff:~$ cat > prog.py

print 'hola'range(5)[0, 1, 2, 3, 4]

gioby@dayhoff:~$ python prog.py

(python interactive shell) (a python program)

Page 20: Introduction to python for bioinformatics

Python interactive shell

You can use it to run programs without having to save them to a script.

It has not a 'session' equivalent like in R

Many programmers prefer to use 'ipython', an enhanced version of this shell

gioby@dayhoff:~$ pythonPython 2.5.2Type "help", "copyright", "credits" or "license" for more information.>>>>>> print 'hola''hola'

>>> range(5)[0, 1, 2, 3, 4]

Page 21: Introduction to python for bioinformatics

IPython sessiongioby@dayhoff:~$ ipythonType "copyright", "credits" or "license" for more information.

In [1]: import random

In [2]: random.choice(['ciao', 'hola', 'hello'])Out[2]: 'hello'

In [3]: 1200 / 2Out[3]: 600

In [4]: random?(shows documentation on the random module)

In [5]: random.<TAB>(shows auto-completition)

In [6]: !ls(executes a bash command)

Page 22: Introduction to python for bioinformatics

Programming paradigms and testing

Programming paradigms

☺☺☺☺☺ Multi paradigm (Object Oriented, Structured, Functional, etc..)

Testing support ☺☺☺☺☺ doctest, unittest, nose

Page 23: Introduction to python for bioinformatics

Python is a multiparadigm language

Your python programs can be a simple list of instructions (imperative approach),

or you can write functions (functional) or you can use objects (object oriented) It's a multi-paradigm language

Page 24: Introduction to python for bioinformatics

Python as a imperative language

print 'Hi, I am the psychotherapist'print 'How do you do? What brings you here?'

response = raw_input()print 'can you elaborate on that?'

response = raw_input()print 'Why do you say it is ', response, '?'

....

Page 25: Introduction to python for bioinformatics

Python as a functional language

def get_sequence(fastafilehandler):'''extracts the sequence from a fasta file'''sequence = ''for line in filehandler.readlines():

if line.startswith('>'):sequence += line

else:pass

def main():'''execute the main functions'''filepath = 'samplefile.txt' filehandler = open(, 'r')get_sequence(filehandler)

.....

Page 26: Introduction to python for bioinformatics

Object Oriented Programming explained in two sentences

When you start having complicated nested variables (like arrays of hashes of arrays of

lists of .....)→ Object Oriented programming is something you should look at

Page 27: Introduction to python for bioinformatics

Object Oriented Programming example

genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts':

{'transcript1': [......],'transcript2': [......],

},},

'gene1':{ 'position': ...........},

.....}

def get_subseq(genes, geneid, start, end):''' get a subsequence of a gene, given a dictionary of gene

annotations, a gene id, and start/end position '''pass

Page 28: Introduction to python for bioinformatics

Object Oriented Programming example

genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts':

{'transcript1': [......],'transcript2': [......],

},},

'gene1':{ 'position': ...........},

.....}

def get_subseq(genes, geneid, start, end):''' get a subsequence of a gene, given a dictionary of gene

annotations, a gene id, and start/end position '''pass

Page 29: Introduction to python for bioinformatics

A python class

class gene:

def __init__(self):position = None sequence = ''transcripts = []

def get_subseq(self, start, end):pass

Python's syntax for classes is easy

More concise than Java, and not mandatory to use classes

OO is very complicated in Perl

Page 30: Introduction to python for bioinformatics

Python and Java classes

public class Gene {

public int position;public str chromosome;public str transcripts[];

public Gene(int pos){position = pos

}

public void getSubseq(start, end) {

pass}

class gene:

def __init__(self,pos):self.position = pos self.sequence = ''self.transcripts = []

def get_subseq(self, start, end):

pass

(A Python Class)

(A Java Class)

Page 31: Introduction to python for bioinformatics

Three ways to test a python program

When you write a program or a script and want to publish its results, you also need a way to prove that it works correctly

Python has good instruments for testing: Doctest Unittest Nosetest

Page 32: Introduction to python for bioinformatics

doctest

With doctest, you put examples of the usage of a function in its documentation

>>> help(say_hello)

Help on function say_hello in module __main__:

say_hello(name) print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!!

Doctests tries to re-execute these examples, and if they don't return the expected values, an error is raised

Page 33: Introduction to python for bioinformatics

Doctest example 2

Page 34: Introduction to python for bioinformatics

Doctest example 3

Doctest are useful when you collaborate with non programmers

Page 35: Introduction to python for bioinformatics

unittest

From unittest import *

class SimpleFastaSeqCase(unittest.TestCase):

@classmethoddef setUpClass(cls):

.....@classmethoddef tearDownClass(cls):

.....def setUp(self):

.....def tearDown(self):

.....def testCondition1(self):

.....def testCondition2(self):

.....

Instructions to be executed before/after all the tests

Instructions to be executed before/after each one of the tests

Tests

Page 36: Introduction to python for bioinformatics

nosetest

Nosetest - it scans your code and looks for all the functions with the word 'test_' in their names

def getfasta(filename):pass

def count_numbers(numbers, limit):pass

def you_like_this_talk(subliminal = True)pass

def test_everything_ok():pass This is a test

Page 37: Introduction to python for bioinformatics

Message

Python is easy to learn and write It has good tools to test and demonstrate that

your programs work correctly

Page 38: Introduction to python for bioinformatics

Python – some bioinfo use cases

Regular expressions, motif search

re, TAMO, biopython ☺☺☺☺ To use regular expressions, it is necessary to import a module. Getting help is easier

Convert a sequence file to another format

biopython ☺☺☺☺ Biopython is growing its support for bioinformatics formats

Working with genomic data

pygr ☺☺☺☺ Pygr is a great environment to work with genomic data

Query Genbank Biopython, pygr ☺☺☺☺Structural Bioinformatics

I don't know

Page 39: Introduction to python for bioinformatics

Regular Expressions in Python

Using Regular Expressions in Python requires an additional step than Perl

You have to import a module called re first Regular expressions are also less 'central' to

the developers of the language

Page 40: Introduction to python for bioinformatics

Example – Regular Expressions in python

>>> import re

>>> sequence = 'ACGGCTAGGTCGATGCGATCG'

>>> re.findall('A.G', sequence)['ACG', 'AGG', 'ATG']

>>> help(re)<get help on regular expressions>

The only advantage of python over perl for regular expression is that it is easier to get help

Page 41: Introduction to python for bioinformatics

Biopython

A collection of free modules for bioinformatics number of functionalities implemented:

bioconductor > bioperl > biopython > all others

Strong points: File format support NCBI – entrez APIs Pdb / structures

Page 42: Introduction to python for bioinformatics

Biopython Examples

# Parse a Fasta File and convert it to Genbankfrom Bio.SeqIO import SeqIOseqfile = open('fastafile.fa', 'r')

sequences = SeqIO.to_dict(SeqIO.parse(seqfile))

# Query NCBIresults = Entrez.esearch(db='nucleotide', term='cox2')Entrez.read(results)

Page 43: Introduction to python for bioinformatics

Pygr

Great for genome-wide analysis Makes it automatic to

Store/retrieve data in databases or pickles Use and configure local blast databases Creating annotations and storing them Interface with ncbi, ensembl (eq. to ensembl perl

APIs), ucsc

Page 44: Introduction to python for bioinformatics

Pygr examples

# Ensembl APIsserverRegistry = get_registry(

host= 'ensembldb.ensembl.org',user='anonymous')

coreDBAdaptor = serverRegistry.get_DBAdaptor( 'homo_sapiens', 'core',

'47_36i')sequence = coreDBAdaptor.fetch_slice_by_seqregion(

coordSystemName, seqregionName)

# Download the sequence of the Human Genome (18)import pygr.Datahg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18(

download=True)

Page 45: Introduction to python for bioinformatics

TAMO and pyHMM

>>> from TAMO import MotifTools>>> msa = ['TGACTCA',... 'TGACTCA',... 'TGAGTCA',... 'TGAGTCA']

>>> m_msa = MotifTools.Motif(msa)>>> print m_msaTGAsTCA(4)

>>> m_msa._print_counts()# 0 1 2 3 4 5 6 #A 0.000 0.000 4.000 0.000 0.000 0.000 4.000 #C 0.000 0.000 0.000 2.000 0.000 4.000 0.000 #T 4.000 0.000 0.000 0.000 4.000 0.000 0.000 #G 0.000 4.000 0.000 2.000 0.000 0.000 0.000

Module to work with motifs

Page 46: Introduction to python for bioinformatics

Python – bioinformatics utilities

Scientific and statistics

scipy + numpy ☺☺☺☺☺

Plotting graphs Matplotlib (pylab) ☺☺☺☺☺SOAP / web scraping utilities

suds ☺☺☺

ORM modules, database handling, HDF5

Sqlalchemy + elixir, sqlobject, pytables ☺☺☺☺☺

Persistent data cPickle, shelf, ZODB ☺☺☺☺ No R-like sessions

Page 47: Introduction to python for bioinformatics

Python and Databases

There are some good libraries to Object Relational Mapping (ORM)

ZODB: Object Oriented Database PyTables: hierarchical database (supports

HDF5, a binary format used in astronomy/physics to store big data)

Page 48: Introduction to python for bioinformatics

sqlalchemy example

Page 49: Introduction to python for bioinformatics

Scientific Python

Numpy: python module to work with arrays and matrixes

Scipy: module to do advanced math, statistics, and more

Matplotlib: module to plot graphics To get started with python and plotting graphs:

$: easy_install numpy scipy matplotlib ipython$: ipython -pylab

Page 50: Introduction to python for bioinformatics

Numpy/Scipy example

Hint: use ipython -pylab to have an R-like environment

Page 51: Introduction to python for bioinformatics

Is there anything I forgot?

?????

?????

?????

Page 52: Introduction to python for bioinformatics

Thank you for the attention!

PRBB technical seminars: http://bg.imim.es/technical-seminars/

These slides will be uploaded on http://www.slideshare.net

Page 53: Introduction to python for bioinformatics
Page 54: Introduction to python for bioinformatics

Discarded slides

Page 55: Introduction to python for bioinformatics

Hint: use ipython -pylab

The best way to work with python and plotting graphs is with ipython -pylab

It will give you a shell similar to matlab/octave/R/etc..

Page 56: Introduction to python for bioinformatics

Regular expressions

To use regular expressions in python, you need to import the 're' module first

It's not so immediate as with perl, where you can use regular expressions without importing anything

However, it is easier to get the documentation

Page 57: Introduction to python for bioinformatics

Main python modules for bioinformatics

Biopython Pygr

Page 58: Introduction to python for bioinformatics

Python – storing/accessing data

Reading/Writing files ☺☺☺☺☺Persistent data ☺☺☺ cPickle, shelf, ZoDB

Database – Object Relational Mapping libraries

☺☺☺☺☺ sqlalchemy, elixir

Binary formats (HDF5)

☺☺☺☺ pytables

R-like sessions Nothing of my knowledge :(