Introduction to python for bioinformatics

Preview:

DESCRIPTION

A talk for the PRBB Technical Seminars series (http://bg.imim.es/technical-seminars/) on python and bioinformatics

Citation preview

prbb technical seminars

Introduction to Pythonfor bioinformatics

Giovanni MarcoDall'OlioUnidad de Biologia Evolutiva – CEXS

Barcelona (Spain)

Python

A programming language released in 1991 by Guido Van Rossum

Used for a variety of applications, from scripting to web programming

Adopted by google, yahoo, youtube, CERN, Nasa, Red Hat....

Lots of jokes in the documentation (it is named after the Monty Pythons)

Python and bioinformatics

Python is widely used in bioinformatics

August 2007 survey - www.bioinformaticszen.com www.bioinformatics.org survey

Python – overall view

Learning curve ☺☺☺☺☺ Easy to learn, yet powerful

Readibility of a python program

☺☺☺☺☺

Community, availability of open source modules

☺☺☺☺ (for bioinformatics, CPAN is sligthly bigger)

Programming paradigms

☺☺☺☺☺ Multi paradigm (Object Oriented, structured, functional, etc..)

Execution speed Interpreted language;importance of programmer effort over computer effort

Notes:

This talk is full of tables like this

They only reflect my opinion (biologist with 3-4 years experience)

Python – Cons

State of open source libraries for bioinformatics

There is good support, but less compared to perl and R

Execution speed Comparable to perl, java, ruby, ..

SOAP libraries SOAPpy is very old, suds is the best one

Population Genetics modules

As many other specific modules, perl and R are better supported

Lack of true multithreading support

A structural limit make it impossible to have real multithreading in python (various solutions..)

= very sad = fine

Python – what makes me happy

General syntax ☺☺☺☺☺People are forced to write program similar to yours

☺☺☺☺☺

Quicker to write programs

☺☺☺☺☺

Object Oriented, multi-paradigm

☺☺☺☺☺ (will be explained later)

Testing support ☺☺☺☺ ''

Python – learning curve

Python's syntax is easy You can concentrate on algorithms and problems

instead of the programming language

Python – learning curve

Python's syntax is easy So you can concentrate on algorithms and problems

instead of the programming language With python you don't have to worry of:

Learning strange symbols (~=, <>, eq, '\n', {}...) Alternative syntaxes to do the same task Declaring variables Inner structure of strings/arrays Low level IO, passing variables per reference/value, etc..

Example of python code

#!/usr/bin/env python'''Some python examples'''

# example 1: a 'for' loopfor name in ('Albert', 'Aristoteles', 'Archimedes'):

print 'hello, ', name

# example 2: Opening a file and parsing itfilehandler = open('samplefile.txt', 'r')for line in filehandler.readlines():

if line.startswith('>'):print line

else:pass

Python syntax I - indentation

In python, the indentation ( = spaces at the beginning of the line) is part of the syntax.

It is used to delimit loops and conditions, instead of graph parenthesis ({})

Example:

The first 'print' is inside the cycle, while the second is outside

for name in ('Albert', 'Aristoteles', 'Dayhoff'):print 'hello, ', name

print 'and hello to you, too'

A quick perl/python comparison

#!/usr/bin/env python

a = 3

if a == 3:print 'a is eq to 3'

#!/usr/bin/env perl

my $a = 3;

if ($a == 3){ print "a is eq to 3\n";}

(Python) (Perl)

Python code is usually easier to read and contains less symbols (like {})

Python syntax II - simplicity

Python has the minimal number of syntax keywords.

There is: only one way to open files (no 'fopen', 'openf', etc..)

only one to print (no printn, printf, sprintf, sprint, etc..)

only two ways to define loops ('for' and 'while').

Python's phylosophy is about simplicity.

Your colleagues are forced to write their programs in the same way as you.

Python syntax III – declaring var

You don't need to declare variables The type of a variable is defined the first time

you assign a value to it

a = 'cacagtcaga' → a is a string

b = 133 → b is an integer

c = True → c is a boolean

Notes on Python's speed

Python is an interpreted language its speed is at the level of perl, java, etc. programs are slower than C, but it's faster to write

them importance of programmer effort over computer

effort

Many ways to speed up python modules can also be written in C some compilers exist (PyPy) Google is working on an enhanced version of

python (news of March 2009).

Python – programming goodies

Installation and portability

☺☺☺☺☺ Installed by default in most linux distribution, interpreted

IDLE / text editors

☺☺☺☺☺ Interactive shell, ipython, many editors

Install and search new modules

☺☺☺ easy_install, PyPI

Testing support ☺☺☺☺☺ doctest, unittest, nose

Writing documentation

☺☺☺☺☺

Debugging Logging, pdb

Python – installation and portability

Python comes installed by default in most of the GNU/Linux distributions Mac users have an old version (2.5), but can

upgrade it On windows, you need to dowload an installer from

www.python.org first

Being an interpreted language, python programs are easy to port in other platforms

PyPI (Python Package Index)

PyPI (repository of public python modules) pypi.python.org

PyPI is a repository of open source modules for python

For bioinformatics, it is smaller than to CPAN, CRAN/bioconductor, etc..

Python – installing new modules

Modules can be automatically downloaded and installed using a tool called 'easy_install'

Examples: easy_install -U biopython # install or update

biopython from PyPI easy_install --prefix ~/usr biopython # install biopython

without requiring admin privilegies easy_install biopython.tar.gz # install biopython from a

previously downloaded tar ball easy_install http://www.biopython.org/install # install

biopython from its web site

Using python

Python can be used as an interactive shell (like R, octave, matlab, etc..) or by writing programs

gioby@dayhoff:~$ python

>>>>>> print 'hola''hola'

>>> range(5)[0, 1, 2, 3, 4]

gioby@dayhoff:~$ cat > prog.py

print 'hola'range(5)[0, 1, 2, 3, 4]

gioby@dayhoff:~$ python prog.py

(python interactive shell) (a python program)

Python interactive shell

You can use it to run programs without having to save them to a script.

It has not a 'session' equivalent like in R

Many programmers prefer to use 'ipython', an enhanced version of this shell

gioby@dayhoff:~$ pythonPython 2.5.2Type "help", "copyright", "credits" or "license" for more information.>>>>>> print 'hola''hola'

>>> range(5)[0, 1, 2, 3, 4]

IPython sessiongioby@dayhoff:~$ ipythonType "copyright", "credits" or "license" for more information.

In [1]: import random

In [2]: random.choice(['ciao', 'hola', 'hello'])Out[2]: 'hello'

In [3]: 1200 / 2Out[3]: 600

In [4]: random?(shows documentation on the random module)

In [5]: random.<TAB>(shows auto-completition)

In [6]: !ls(executes a bash command)

Programming paradigms and testing

Programming paradigms

☺☺☺☺☺ Multi paradigm (Object Oriented, Structured, Functional, etc..)

Testing support ☺☺☺☺☺ doctest, unittest, nose

Python is a multiparadigm language

Your python programs can be a simple list of instructions (imperative approach),

or you can write functions (functional) or you can use objects (object oriented) It's a multi-paradigm language

Python as a imperative language

print 'Hi, I am the psychotherapist'print 'How do you do? What brings you here?'

response = raw_input()print 'can you elaborate on that?'

response = raw_input()print 'Why do you say it is ', response, '?'

....

Python as a functional language

def get_sequence(fastafilehandler):'''extracts the sequence from a fasta file'''sequence = ''for line in filehandler.readlines():

if line.startswith('>'):sequence += line

else:pass

def main():'''execute the main functions'''filepath = 'samplefile.txt' filehandler = open(, 'r')get_sequence(filehandler)

.....

Object Oriented Programming explained in two sentences

When you start having complicated nested variables (like arrays of hashes of arrays of

lists of .....)→ Object Oriented programming is something you should look at

Object Oriented Programming example

genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts':

{'transcript1': [......],'transcript2': [......],

},},

'gene1':{ 'position': ...........},

.....}

def get_subseq(genes, geneid, start, end):''' get a subsequence of a gene, given a dictionary of gene

annotations, a gene id, and start/end position '''pass

Object Oriented Programming example

genes = {'gene1': { 'position': 10000, 'chromosome': 11, 'sequence': 'GTAGCCTGATGAACGGGCTAGCATGC....', 'transcripts':

{'transcript1': [......],'transcript2': [......],

},},

'gene1':{ 'position': ...........},

.....}

def get_subseq(genes, geneid, start, end):''' get a subsequence of a gene, given a dictionary of gene

annotations, a gene id, and start/end position '''pass

A python class

class gene:

def __init__(self):position = None sequence = ''transcripts = []

def get_subseq(self, start, end):pass

Python's syntax for classes is easy

More concise than Java, and not mandatory to use classes

OO is very complicated in Perl

Python and Java classes

public class Gene {

public int position;public str chromosome;public str transcripts[];

public Gene(int pos){position = pos

}

public void getSubseq(start, end) {

pass}

class gene:

def __init__(self,pos):self.position = pos self.sequence = ''self.transcripts = []

def get_subseq(self, start, end):

pass

(A Python Class)

(A Java Class)

Three ways to test a python program

When you write a program or a script and want to publish its results, you also need a way to prove that it works correctly

Python has good instruments for testing: Doctest Unittest Nosetest

doctest

With doctest, you put examples of the usage of a function in its documentation

>>> help(say_hello)

Help on function say_hello in module __main__:

say_hello(name) print hello <name> to the screen example: >>> say_hello('Albert Einstein') hello Albert Einstein!!!

Doctests tries to re-execute these examples, and if they don't return the expected values, an error is raised

Doctest example 2

Doctest example 3

Doctest are useful when you collaborate with non programmers

unittest

From unittest import *

class SimpleFastaSeqCase(unittest.TestCase):

@classmethoddef setUpClass(cls):

.....@classmethoddef tearDownClass(cls):

.....def setUp(self):

.....def tearDown(self):

.....def testCondition1(self):

.....def testCondition2(self):

.....

Instructions to be executed before/after all the tests

Instructions to be executed before/after each one of the tests

Tests

nosetest

Nosetest - it scans your code and looks for all the functions with the word 'test_' in their names

def getfasta(filename):pass

def count_numbers(numbers, limit):pass

def you_like_this_talk(subliminal = True)pass

def test_everything_ok():pass This is a test

Message

Python is easy to learn and write It has good tools to test and demonstrate that

your programs work correctly

Python – some bioinfo use cases

Regular expressions, motif search

re, TAMO, biopython ☺☺☺☺ To use regular expressions, it is necessary to import a module. Getting help is easier

Convert a sequence file to another format

biopython ☺☺☺☺ Biopython is growing its support for bioinformatics formats

Working with genomic data

pygr ☺☺☺☺ Pygr is a great environment to work with genomic data

Query Genbank Biopython, pygr ☺☺☺☺Structural Bioinformatics

I don't know

Regular Expressions in Python

Using Regular Expressions in Python requires an additional step than Perl

You have to import a module called re first Regular expressions are also less 'central' to

the developers of the language

Example – Regular Expressions in python

>>> import re

>>> sequence = 'ACGGCTAGGTCGATGCGATCG'

>>> re.findall('A.G', sequence)['ACG', 'AGG', 'ATG']

>>> help(re)<get help on regular expressions>

The only advantage of python over perl for regular expression is that it is easier to get help

Biopython

A collection of free modules for bioinformatics number of functionalities implemented:

bioconductor > bioperl > biopython > all others

Strong points: File format support NCBI – entrez APIs Pdb / structures

Biopython Examples

# Parse a Fasta File and convert it to Genbankfrom Bio.SeqIO import SeqIOseqfile = open('fastafile.fa', 'r')

sequences = SeqIO.to_dict(SeqIO.parse(seqfile))

# Query NCBIresults = Entrez.esearch(db='nucleotide', term='cox2')Entrez.read(results)

Pygr

Great for genome-wide analysis Makes it automatic to

Store/retrieve data in databases or pickles Use and configure local blast databases Creating annotations and storing them Interface with ncbi, ensembl (eq. to ensembl perl

APIs), ucsc

Pygr examples

# Ensembl APIsserverRegistry = get_registry(

host= 'ensembldb.ensembl.org',user='anonymous')

coreDBAdaptor = serverRegistry.get_DBAdaptor( 'homo_sapiens', 'core',

'47_36i')sequence = coreDBAdaptor.fetch_slice_by_seqregion(

coordSystemName, seqregionName)

# Download the sequence of the Human Genome (18)import pygr.Datahg18 = pygr.Data.Bio.Seq.Genome.HUMAN.hg18(

download=True)

TAMO and pyHMM

>>> from TAMO import MotifTools>>> msa = ['TGACTCA',... 'TGACTCA',... 'TGAGTCA',... 'TGAGTCA']

>>> m_msa = MotifTools.Motif(msa)>>> print m_msaTGAsTCA(4)

>>> m_msa._print_counts()# 0 1 2 3 4 5 6 #A 0.000 0.000 4.000 0.000 0.000 0.000 4.000 #C 0.000 0.000 0.000 2.000 0.000 4.000 0.000 #T 4.000 0.000 0.000 0.000 4.000 0.000 0.000 #G 0.000 4.000 0.000 2.000 0.000 0.000 0.000

Module to work with motifs

Python – bioinformatics utilities

Scientific and statistics

scipy + numpy ☺☺☺☺☺

Plotting graphs Matplotlib (pylab) ☺☺☺☺☺SOAP / web scraping utilities

suds ☺☺☺

ORM modules, database handling, HDF5

Sqlalchemy + elixir, sqlobject, pytables ☺☺☺☺☺

Persistent data cPickle, shelf, ZODB ☺☺☺☺ No R-like sessions

Python and Databases

There are some good libraries to Object Relational Mapping (ORM)

ZODB: Object Oriented Database PyTables: hierarchical database (supports

HDF5, a binary format used in astronomy/physics to store big data)

sqlalchemy example

Scientific Python

Numpy: python module to work with arrays and matrixes

Scipy: module to do advanced math, statistics, and more

Matplotlib: module to plot graphics To get started with python and plotting graphs:

$: easy_install numpy scipy matplotlib ipython$: ipython -pylab

Numpy/Scipy example

Hint: use ipython -pylab to have an R-like environment

Is there anything I forgot?

?????

?????

?????

Thank you for the attention!

PRBB technical seminars: http://bg.imim.es/technical-seminars/

These slides will be uploaded on http://www.slideshare.net

Discarded slides

Hint: use ipython -pylab

The best way to work with python and plotting graphs is with ipython -pylab

It will give you a shell similar to matlab/octave/R/etc..

Regular expressions

To use regular expressions in python, you need to import the 're' module first

It's not so immediate as with perl, where you can use regular expressions without importing anything

However, it is easier to get the documentation

Main python modules for bioinformatics

Biopython Pygr

Python – storing/accessing data

Reading/Writing files ☺☺☺☺☺Persistent data ☺☺☺ cPickle, shelf, ZoDB

Database – Object Relational Mapping libraries

☺☺☺☺☺ sqlalchemy, elixir

Binary formats (HDF5)

☺☺☺☺ pytables

R-like sessions Nothing of my knowledge :(

Recommended