Gautier bosc2010 pythonbioconductor

Preview:

Citation preview

Bioconductor with Python, What else ?ISMB / BOSC

Laurent Gautier [laurent@cbs.dtu.dk]

DMAC / CBS

July 10th, 2010

1 / 20

Disclaimer• This is not about the comparative merits of scripting

languages• This is about being able to access natively libraries

implemented in a different language

2 / 20

About Bioconductor

• Set of open-source packages for R• Started circa 2002 with a focus on microarrays• Rooted in statistics, data analyis, and visualization• Several hundred packages, addresses NGS, HTS, flow

cytometry, protein-protein interactions, . . .• Biannual releases• Presence on the publication circuit ( > 2, 300 citations for

the BioC publication, > 600 for limma, > 500 for affy )

3 / 20

About Python

• Simple and clear all-purpose scripting language• Sometimes used in introductions to programming• Popular for agile development• Bioinformatics libraries:

• biopython (libraries for bioinformatics)• galaxy (web front-end to pipelines)• PyCogent, pygr, bx-python (biological sequences-oriented)

• Large selection of libraries:• Web development: Zope, Django, Google App Engine• Scientific computing: Scipy / Numpy• Cloud computing: Disco, execnet• Interface with C: ctypes, Cython

4 / 20

A view on R/bioconductor and Python in bioinformatics

Bioinformaticsdata

Automation

Storage /Retrieval

SamplesMicroarray

NGS

Annotation

Flow-cytometry,

proteomics,other

assays. . .

R/BioconductorStatisticalanalysis

Visualization

Interactiveprogram-

ming

Python

Non-interactive

abilitiesData

storage /retrieval

Web

Algorithmdevelopment

Scientificcomputing

Python is an all-purpose scriptinglanguage.

Communities

ComputerScientists

Physicists

Biologists

Statisticians

5 / 20

Bioinformaticsdata

Automation

Storage /Retrieval

SamplesMicroarray

NGS

Annotation

Flow-cytometry,

proteomics,other

assays. . .

R/BioconductorStatisticalanalysis

Visualization

Interactiveprogram-

ming

Python

Non-interactive

abilitiesData

storage /retrieval

Web

Algorithmdevelopment

Scientificcomputing

Python is an all-purpose scriptinglanguage.

Communities

ComputerScientists

Physicists

Biologists

Statisticians

Bioinformaticsdata

Automation

Storage /Retrieval

SamplesMicroarray

NGS

Annotation

Flow-cytometry,

proteomics,other

assays. . .

R/BioconductorStatisticalanalysis

Visualization

Interactiveprogram-

ming

Python

Non-interactive

abilitiesData

storage /retrieval

Web

Algorithmdevelopment

Scientificcomputing

Python is an all-purpose scriptinglanguage.

Communities

ComputerScientists

Physicists

Biologists

Statisticians

Running R code from Python (an example)AimRunning edgeR from Python

MethodRobinson MD, McCarthy DJ and Smyth GK (2010). edgeR:a Bioconductor package for differential expression analysisof digital gene expression data. Bioinformatics 26, 139-140

DataControl Treated

lane1 lane2 lane3 lane4 lane5 lane6 lane8ENSG00000230758 0 0 1 0 0 0 0ENSG00000182463 0 2 4 1 5 5 0ENSG00000124208 82 124 102 136 90 120 40ENSG00000230753 0 0 0 3 0 0 0ENSG00000224628 7 8 8 18 8 7 1ENSG00000125835 138 209 227 295 281 220 54ENSG00000125834 25 31 48 56 67 61 15ENSG00000197818 17 27 16 26 41 39 9ENSG00000243473 0 0 0 2 0 0 0ENSG00000226325 0 0 2 0 3 1 0

. . . . . . . . . . . . . . . . . . . . . . . .

7 / 20

from rpy2.robjects.packages import importrfrom bioc import edger

base = importr(’base’)

summarized = edger.DGEList.new(counts = counts,lib_size = base.colSums(counts),group = grp)

disp = edger.estimateCommonDisp(summarized)

tested = edger.exactTest(disp)

results = edger.topTags(tested)

logConc logFC PValue FDRENSG00000127954 -31.03 37.97 0.00 0.00ENSG00000151503 -12.96 5.40 0.00 0.00ENSG00000096060 -11.78 4.90 0.00 0.00ENSG00000091879 -15.36 5.77 0.00 0.00ENSG00000132437 -14.15 -5.90 0.00 0.00ENSG00000166451 -12.62 4.57 0.00 0.00ENSG00000131016 -14.80 5.27 0.00 0.00ENSG00000163492 -17.28 7.30 0.00 0.00ENSG00000113594 -12.25 4.05 0.00 0.00ENSG00000116285 -13.02 4.11 0.00 0.00

8 / 20

R code / Python codelibrary(edgeR)summarized <- DGEList(counts = counts,

lib.size = colSums(counts),group = grp)

disp <- estimateCommonDisp(summarized)

from rpy2.robjects.packages import importrbase = importr(’base’)from bioc import edger

summarized = edger.DGEList.new(count = counts,lib_size = base.colSums(counts),group = grp)

disp = edger.estimateCommonDisp(summarized)

Note:• explicit in searching through namespaces• call R functions as native Python functions• use R objects as Python objects

9 / 20

Bioconductor library IRanges

10 / 20

Bioconductor library Biostrings

11 / 20

Separate communities

12 / 20

Bilingual community

13 / 20

Interpreters/Translators

14 / 20

Cost of translation

R package Python modulelines of code

AnnotationDbi 168 annotationdbi.pyBiobase 341 biobase.pyBiostrings 591 biostrings.pyBSgenome 112 bsgenome.pyedgeR 107 edger.pyGEOquery 102 geoquery.pyGGbase 104 ggbase.pyGGtools 77 ggtools.pygoseq 43 goseq.pyGSEABase 149 gseabase.pyIRanges 295 iranges.pyShortRead 301 shortread.py

15 / 20

R within Python• R is running as embedded into Python• R objects remain in the R workspace, but can be accessed

from Python• Python-level shells to access the R objects• The rpy2 package is used to achieve so

biostrings = importr(’Biostrings’)class AAString(XString):

_aastring_constructor = biostrings.AAString

@classmethoddef new(cls, x):

""" :param x: a string of amino-acids """res = cls(cls._aastring_constructor(conversion.py2ri(x)))_setExtractDelegators(res)return res

aas = AAString("PROTEIN")

16 / 20

What is needed to continue

More interpreters/translators• Many bioconductor packages.• Keep up-to-date existing translations.

Keeping up-to-date• Frequent API-breaking changes in bioconductor• Taylored interfaces increase maintenance• Meta-programming and reflexivity can alleviate this

17 / 20

Example with meta-programming:

class AssayData(rpy2.robjects.methods.RS4):""" Abstract class. That class in a ClassUnionRepresentationin R, that a is way to create a parent class for existingclasses. This is currently not modelled in Python. """__rname__ = ’AssayData’__metaclass__ = rpy2.robjects.methods.RS4_Type

__accessors__ = ((’featureNames’, ’Biobase’, ’featurenames’,True, ’maps Biobase::featureNames’),(’sampleNames’, ’Biobase’, ’samplenames’,True, ’maps Biobase::samplenames’),(’storageMode’, ’Biobase’, ’storagemode’,True, ’maps Biobase::storageMode’))

18 / 20

Example of a complete applicationA web-server to run EdgeR.

from bottle import route, runfrom my_edger import get_toptags, make_results_page@route(’/’)def index():

return ’’’<html> <body><form action="/edger" method="post" enctype="multipart/form-data"><input type="file" name="data" /> </form></body> </html>’’’

@route(’/edger’, method=’POST’)def run_edger():

data = request.files.get(’data’)if data:

counts, grp = read_count_data(data.file.name)top_tags = get_toptags(counts, grp)return make_result_page(top_tags)

else:abort(404, "Invalid count file.")

run(host=’localhost’, port=8080)

19 / 20

Acknowledgements• Users, and communities from R, Bioconductor, Python,

Biopython• (Vincent Davis, Nicolas Rapin, Brad Chapman)

URLshttp://pypi.python.org/pypi/rpy2-bioconductor-extensions/

http://bitbucket.org/lgautier/rpy2-bioc-extensions

http://packages.python.org/rpy2-bioconductor-extensions/ http://rpy2.sourceforge.net/

20 / 20

21 / 20