Cheminformatics in R

Cheminformaticsin R

1/189

Cheminformatics in R

or How to Start R and Never Have to Exit

Rajarshi Guha

NIH Chemical Genomics Center

17th May, 2010EBI, Hinxton

Cheminformaticsin R

2/189

Background

Cheminformaticsin R

3/189

Background

I Involved in various aspects ofcheminformatics since 2002

I QSAR modeling, virtual screening,polypharmacology, networks

I Algorithm developementI Cheminformatics software

development

I Core developer for the CDK

Cheminformaticsin R

4/189

Background

I Pretty much live inside of R

I Been using it since 2003, developed a number of Rpackages, mostly public

I Make extensive use of R at NCGC for small molecule &RNAi screening

Cheminformaticsin R

5/189

Acknowledgements

I rcdkI Miguel RojasI Ranke Johannes

I CDKI Egon WillighagenI Christoph SteinbeckI . . .

Cheminformaticsin R

6/189

Resources

I Download binariesI If you’re working on Unix it’s a good idea to compile

from source

I There’s lots of documentation of varying qualityI The R manual is a good document to start withI R Data Import and Export is extremely handy regarding

data loadingI A variety of short introductory tutorials as well as

documents on specific topics in statistical modeling canbe found here

I More general help can be obtained fromI The R-help mailing list. Has a lot of renoowned experts

on the list and if you can learn quite a bit of statistics inaddition to getting R help just be reading the archives ofthe list. Note: they do expect that you have done yourhomework! See the posting guide

I A useful reference for Python/Matlab users learning R

http://lib.stat.cmu.edu/R/CRAN/bin/

http://lib.stat.cmu.edu/R/CRAN/sources.html

http://lib.stat.cmu.edu/R/CRAN/doc/manuals/R-intro.html

http://lib.stat.cmu.edu/R/CRAN/doc/manuals/R-data.html

http://lib.stat.cmu.edu/R/CRAN/other-docs.html

http://www.r-project.org/mail.html

http://www.r-project.org/posting-guide.html

http://www.scribd.com/doc/7214205/Matlab-Python-Xref

Cheminformaticsin R

7/189

R Books

I A number of books are available ranging fromintroductions to R along with examples up to advancedstatistical modeling using R.

I Data Analysis and Graphics Using R: An Example-basedApproach by John Maindonald, John Braun

I Introductory Statistics with R by Peter Dalgaard

I Using R for Introductory Statistics by John Verzani

I If you end up programming a lot in R, you’ll want tohave Programming with Data: A Guide to the SLanguage by John M. Chambers

http://www.amazon.com/Data-Analysis-Graphics-Using-Example-based/dp/0521861160/ref=sr_1_1?ie=UTF8&s=books&qid=1202837378&sr=1-1

http://www.amazon.com/Data-Analysis-Graphics-Using-Example-based/dp/0521861160/ref=sr_1_1?ie=UTF8&s=books&qid=1202837378&sr=1-1

http://www.amazon.com/Introductory-Statistics-R-Peter-Dalgaard/dp/0387954759/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1202837410&sr=1-1

http://www.amazon.com/Using-Introductory-Statistics-John-Verzani/dp/1584884509/ref=sr_1_1?ie=UTF8&s=books&qid=1202837456&sr=1-1

http://www.amazon.com/Programming-Data-Guide-S-Language/dp/0387985034/ref=pd_bbs_sr_2?ie=UTF8&s=books&qid=1202837496&sr=1-2

http://www.amazon.com/Programming-Data-Guide-S-Language/dp/0387985034/ref=pd_bbs_sr_2?ie=UTF8&s=books&qid=1202837496&sr=1-2

Cheminformaticsin R

8/189

IDE’s and editors

I Emacs + ESS - works on Linux, Windows, OS X.Invaluable if you are an Emacs user

I TINN-R is a small and useful editor for WindowsI You can also do R in Eclipse

I StatETI Rsubmit

I A number of different GUI’s are available for, which maybe useful if you’re an infrequent user

I RcommanderI JGRI RattleI PMGI SciViews-RI Red-R

http://ess.r-project.org/

https://sourceforge.net/projects/tinn-r

http://www.walware.de/goto/statet

http://www.alysis.de/Rsubmit/

http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/

http://stats.math.uni-augsburg.de/JGR/

http://rattle.togaware.com/

http://wiener.math.csi.cuny.edu/pmg

http://www.sciviews.org/SciViews-R

http://www.red-r.org/

Cheminformaticsin R

9/189

What is R?

I R is an environment for modelingI Contains many prepackaged statistical and mathematical

functionsI No need to implement anything

I R is a matrix programming language that is good forstatistical computing

I Full fledged, interpreted languageI Well integrated with statistical functionalityI More details later

Cheminformaticsin R

10/189

What is R?

I It is possible to use R just for modelingI Avoids programming, preferably use a GUI

I Load data → build model → plot data

I But you can also get much more creativeI Scripts to process multiple data filesI Ensemble modeling using different types of modelsI Implement brand new algorithms

I R is good for prototyping algorithmsI Interpreted, so immediate resultsI Good support for vectorization

I Faster than explicit loopsI Analogous to map in Python and Lisp

I Most times, interpreted R is fine, but you can easilyintegrate C code

Cheminformaticsin R

11/189

What is R?

I R integrates with other languagesI C code can be linked to R and C can also call R functionsI Java code can be called from R and vice versa. See

various packages at rosuda.orgI Python can be used in R and vice versa using Rpy

I R has excellent support for publication quality graphics

I See R Graph Gallery for an idea of the graphingcapabilities

I But graphing in R does have a learning curveI A variety of graphs can be generated

I 2D plots - scatter, bar, pie, box, violin, parallel coordinateI 3D plots - OpenGL support is available

http://rosuda.org/software/

http://rpy.sourceforge.net/

http://addictedtor.free.fr/graphiques/

Cheminformaticsin R

12/189

Why Cheminformatics in R?

I In contrast to bioinformatics (cf. Bioconductor), not awhole lot of cheminformatics support for R

I For cheminformatics and chemistry relevant packagesinclude

I rcdk, rpubchem, fingerprintI bio3d, ChemmineR

I A lot of cheminformatics employs various forms ofstatistics and machine learning - R is exactly theenvironment for that

I We just need to add some chemistry capabilities to it

http://mccammon.ucsd.edu/~bgrant/bio3d/

http://bioweb.ucr.edu/ChemMineV2/chemminer/

Cheminformaticsin R

13/189

Outline

I An overview of the R language

I An overview of the CDK library

I Exploring the rcdk package

I Exploring the rcdklibs package

I Extending the code

Cheminformaticsin R

14/189

Accessing packages

I From within R (you’ll need 2.11.0 or better)

install.packages(c("rcdk","rpubchem"), dependencies = TRUE)

I Sources for rcdk and rcdklibs can be obtained fromthe Github repository

I This is required if you want to modify Java codeassociated with rcdk

I Installable development packages (OS X, Linux,Windows) available from http://rguha.net/rcdk

I These will make their way to CRAN

http://github.com/rajarshi/cdkr/

http://rguha.net/rcdk

Cheminformaticsin R

15/189

Workshop prerequisites

I Everything should be installed

I Running the code below should not give any errors

library(rcdk)

library(rpubchem)

source(helper.R)

I Most of the code snippets in these slides are contained incode.R

I You should have a data/ directory containing variousdatasets used in this workshop

I You should have a exercises/ directory containing thesource code solutions to the exercises

Cheminformaticsin R

16/189

The language

Parallel R

Database access

Part I

Overview of R

Cheminformaticsin R

17/189

The language

Parallel R

Database access

Outline

The language

Parallel R

Database access

Cheminformaticsin R

18/189

The language

Parallel R

Database access

Working in R

I To get help for a function name do: ?functionname

I If you don’t know the name of the function, use:help.search("blah")

I R Site Search and Rseek are very helpful online resources

I To exit from the prompt do: quit()

I Two ways to run R codeI Type or paste it at the promptI Execute code from a source file: source("mycode.R")

I Saving your workI save.image(file="mywork.Rda") saves the whole

workspace to mywork.Rda, which is a binary fileI When you restart R, do: load("mywork.Rda") will restore

your workspaceI You can save individual (or multiple) objects using save

http://search.r-project.org/

http://rseek.org/

Cheminformaticsin R

19/189

The language

Parallel R

Database access

Basic types in R

Primitive types

I numeric - indicates an integer or floating point number

I character - an alphanumeric symbol (string)

I logical - boolean value. Possible values are TRUE andFALSE

Complex types

I vector - a 1D collection of objects of the same type

I list - a 1D container that can contain arbitrary objects.Each element can have a name associated with it

I matrix - a 2D data structure containing objects of thesame type

I data.frame - a generalization of the matrix type. Eachcolumn can be of a different type

Cheminformaticsin R

20/189

The language

Parallel R

Database access

Primitive types

> x <- 1.2

> x <- 2

> y <- "abcdegf"

> y

[1] "abcdegf"

> z <- c(1,2,3,4,5)

> z

[1] 1 2 3 4 5

> z <- 1:5

> z

[1] 1 2 3 4 5

Cheminformaticsin R

21/189

The language

Parallel R

Database access

Matrices

I Make a matrix from a vector

I The matrix is constructed column-wise

> z <- c(1,2,3,4,5,6,7,8,9,10)

> m <- matrix(z, nrow=2, ncol=5)

> m

[,1] [,2] [,3] [,4] [,5]

[1,] 1 3 5 7 9

[2,] 2 4 6 8 10

Cheminformaticsin R

22/189

The language

Parallel R

Database access

data.frame

I We want to store compound names and number of atomsand whether they are toxic or not

I Need to store: characters, numbers and boolean values

x <- c("aspirin", "potassium cyanide",

"penicillin", "sodium hydroxide")

y <- c(21, 3, 41,3)

z <- c(FALSE, TRUE, FALSE, TRUE)

d <- data.frame(x,y,z)

> d

x y z

1 aspirin 21 FALSE

2 potassium cyanide 3 TRUE

3 penicillin 41 FALSE

4 sodium hydroxide 3 TRUE

Cheminformaticsin R

23/189

The language

Parallel R

Database access

data.frame

I But do the column means? 6 months later, we probablywon’t know (easily)

I So we should add names

> names(d) <- c("Name", "NumAtom", "Toxic")

> d

Name NumAtom Toxic

1 aspirin 21 FALSE

2 potassium cyanide 3 TRUE

3 penicillin 41 FALSE

4 sodium hydroxide 3 TRUE

I When adding column names, don’t use the underscorecharacter

I You can access a column of a data.frame by it’s name:d$Toxic

Cheminformaticsin R

24/189

The language

Parallel R

Database access

Lists

I A 1D collection of arbitrary objects (ArrayList in Java)

I We can access the lists by indexing: mylist[1] ormylist[[1]] for the first element or mylist[c(1,2,3)] forthe first 3 elements

x <- 1.0

y <- "hello there"

s <- c(1,2,3)

mylist <- list(x,y,s)

> mylist

[[1]]

[1] 1

[[2]]

[1] "hello there"

[[3]]

[1] 1 2 3

Cheminformaticsin R

25/189

The language

Parallel R

Database access

Lists

I It’s useful to name individual elements of a list

x <- "oxygen"

z <- 8

m <- 32

mylist <- list(name="oxygen",

atomicNumber=z,

molWeight=m)

> mylist

$name

[1] "oxygen"

$atomicNumber

[1] 8

$molWeight

[1] 32

Cheminformaticsin R

26/189

The language

Parallel R

Database access

Lists

I We can then get the molecular weight by writing

> mylist$molWeight

[1] 32

Cheminformaticsin R

27/189

The language

Parallel R

Database access

Identifying types

I It’s useful initially, to identify the type of the object

I Usually just printing it out explains what it is

I You can be more concise by doing

> x <- 1

> class(x)

[1] "numeric"

> x <- "hello world"

> class(x)

[1] "character"

> x <- matrix(c(1,2,3,4), nrow=2)

> class(x)

[1] "matrix"

Cheminformaticsin R

28/189

The language

Parallel R

Database access

Indexing

I Indexing fundamental to using R and is applicable tovectors, matrices, data.frame’s and lists

I all indices start from 1 (not 0!)

I For a 1D vector, we get the i ’th element by x[i]

I For a 2D matrix of data.frame we get the i , j element byx[i,j]

I But we can also index using vectorsI Called an index vectorI Allows us to select or exclude multiple elements

simultaneously

Cheminformaticsin R

29/189

The language

Parallel R

Database access

Indexing

I Say we have a vector, x and a matrix y

> x

[1] 1 2 3 4 5 6 7 8 9 10

> y

[,1] [,2] [,3] [,4] [,5]

[1,] 1 4 7 10 13

[2,] 2 5 8 11 14

[3,] 3 6 9 12 15

I Some ways to get elements from x

I 5th element → x[5]

I The 3rd, 4th and 5th elements → x[ c(3,4,5) ]

I Everything but the 3rd, 4th and 5th elements →x[ -c(3,4,5) ]

Cheminformaticsin R

30/189

The language

Parallel R

Database access

Indexing

I Say we have a matrix y

> y

[,1] [,2] [,3] [,4] [,5]

[1,] 1 4 7 10 13

[2,] 2 5 8 11 14

[3,] 3 6 9 12 15

I Some ways to get elements from y

I Element in the first row, second column → y[1,2]

I All elements in the 1st column → y[,1]

I All elements in the 3rd row → y[3,]

I Elements in the first 2 columns → y[ , c(1,2) ]

I Elements not in the first 2 columns → y[ , -c(1,2) ]

Cheminformaticsin R

31/189

The language

Parallel R

Database access

Indexing

I You can also use a boolean vector to index another vector

I In this case, the index vector must of the same length asyour target vector

I If the i ’th element in the index vector is TRUE then thei ’th element of the target is selected

> x <- c(1,2,3,4)

> idx <- c(TRUE, FALSE, FALSE, TRUE)

> x[ idx ]

[1] 1 4

I It’s a pain to have to create a whole index vector by hand

Cheminformaticsin R

32/189

The language

Parallel R

Database access

Indexing

I A more intuitive way to do this, is to make use of R’svector operations

I x == 2 will compare each element of x with the value 2and return TRUE or FALSE

I The result is a boolean vector

> x <- c(1,2,3,4)

> x == 2

[1] FALSE TRUE FALSE FALSE

I So we can select the elements of x that are equal to 2 bydoing

> x <- c(1,2,3,4)

> x[ x == 2 ]

[1] 2

Cheminformaticsin R

33/189

The language

Parallel R

Database access

Indexing

I Or select elements of x that are greater than 2

> x <- c(1,2,3,4)

> x[ x > 2 ]

[1] 3 4

I Or select elements of x that are 2 or 4

> x <- c(1,2,3,4)

> x[ x == 2 | x == 4 ]

[1] 2 4

I The single bar (|) is intentional

I Logical operations using vectors should use |, & ratherthan the usual ||, &&

Cheminformaticsin R

34/189

The language

Parallel R

Database access

List indexing

I With list objects we can index using [ or [[I [ keeps element names, [[ drops names

I Can’t use index vectors or boolean vectors with [[

> y <- list(a=’b’, b=123, x=c(1,2,3))

> y[3]

$x

[1] 1 2 3

> y[[3]]

[1] 1 2 3

> y[[1:2]]

Error in y[[1:2]] : subscript out of bounds

Cheminformaticsin R

35/189

The language

Parallel R

Database access

Subsetting

I Get a subset of the rows of a matrix or data.frame basedon values in a certain column

I We can generate a boolean index vector

I For data.frame’s, we can use a more intuitive approach

d <- data.frame(a=1:10, b=runif(10), c=rnorm(10))

d[d$a <= 6,]

subset(d, a <= 6)

subset(d, a >= 7 & b < 0.4)

Cheminformaticsin R

36/189

The language

Parallel R

Database access

Merging data.frames

I Lets you join data.frames based on a common column

I Pretty much identical to an SQL join (and supportsnatural and outer joins)

I Very handy when merging data from multiple sources

d1 <- data.frame(mol=I(c("mol1", "mol2", "mol3")),

klass=c("active", "active", "inactive"))

d2 <- data.frame(molecule=I(c("mol1", "mol2", "mol3")),

TPSA=c(1.2,3.4,5.6), Kier1=c(5.7, 8.9, 12))

> merge(d1, d2, by.x="mol", by.y="molecule")

mol klass TPSA Kier1

1 mol1 active 1.2 5.7

2 mol2 active 3.4 8.9

3 mol3 inactive 5.6 12.0

Cheminformaticsin R

37/189

The language

Parallel R

Database access

Functions in R

I Functions are much like in any other languageI R employs copy-on-write semantics

I Send in a copy of an input variable if it will be modifiedin the function

I Otherwise send in a reference

I Typing is dynamic

my.add <- function(x,y) {

return(x+y)

}

I You can then call the function as my.add(1,2)

I In general, check for the type of object using class()

I Functions can be anonymous

Cheminformaticsin R

38/189

The language

Parallel R

Database access

Loop constructs

I R does not have C-style looping

for (i = 0; i < 10; i++) {

print(i)

}

I Rather, it works like Python’s for-in idiom

for (i in c(0,2,3,4,5,6,7,8,9)) {

print(i)

}

Cheminformaticsin R

39/189

The language

Parallel R

Database access

Loop constructs

I In general, to loop n times, you create a vector with nelements, say by writing 1 : n

I But like Python loops, you can just loop over elements ofa vector or list and use them

> objects <- list(x=1, y="hello world", z=c(1,2,3))

> for (i in objects) print(i)

[1] 1

[1] "hello world"

[1] 1 2 3

I Good R style discourages explicit loops

Cheminformaticsin R

40/189

The language

Parallel R

Database access

Apply and friends

I Explicit loops (for (...)) are not idiomatic and can beslow

I apply, lapply, sapply resemble functional programmingstyles (but see Reduce, Filter and Map for alternatives)

I Good idea to master these forms

x <- c(1,2,3,4,5)

sapply(x, sqrt)

sapply(x, function(z) z+3)

sapply(x, function(z) rep(z,3), simplify=FALSE)

sapply(x, function(z) rep(z,3), simplify=TRUE)

I Handy for looping over vector elements

http://stat.ethz.ch/R-manual/R-patched/library/base/html/funprog.html

http://stat.ethz.ch/R-manual/R-patched/library/base/html/funprog.html

Cheminformaticsin R

41/189

The language

Parallel R

Database access

Looping over lists

I For list objects, use lapply

I The return value is a list - if it can be converted to asimple vector use unlist to do so

x <- list(x=1, b=2, c=3)

lapply(x, sqrt)

unlist(lapply(x, sqrt))

lapply(x, function(z) {

return(z^2)

})

I The function argument of lapply should expect an objectas the first argument

I Columns of a data.frame are lists, so lapply can beused to loop over columns

Cheminformaticsin R

42/189

The language

Parallel R

Database access

Looping over multiple objects

I What if you want to loop over two (or more) lists (orvectors) simulataneously

I Like Pythons zip would allow

I Use mapply and provide a function that takes (at least) asmany arguments as input vectors/lists

x <- 1:3

y <- list(a="Hello", b=12.34, c=c(1,2,3))

mapply(function(a,b) {

print(a)

print(b)

print("---")

}, x, y)

Cheminformaticsin R

43/189

The language

Parallel R

Database access

Looping over matrices

I Use apply to operate on entire rows or columns of amatrix or data.frame

I Be careful when working on rows of a data.frame -behavior can be unexpected if all columns are not of thesame type

m <- matrix(1:12, nrow=4)

apply(m, 1, sum) ## sum the rows

apply(m, 2, sum) ## sum the columns

I The function argument of apply should expect a vector asthe first argument

Cheminformaticsin R

44/189

The language

Parallel R

Database access

Looping caveats

I The main thing to remember when using apply, lapply

etc is nature of inputs to the user supplied function

I apply will send a vector (corresponding to a row orcolumn)

I lapply and sapply will send a single element of thelist/vector

I mapply will send a single element of each of the inputlists/vectors

I The function argument of these methods can takeanonymous functions or pre-defined functions

Cheminformaticsin R

45/189

The language

Parallel R

Database access

Loading data

I Writing vectors and matrices by hand is good for 15minutes!

I Easier to read data from files

I R supports a variety of text and binary formatsI Certain packages can be loaded to handle special formats

I Read from SPSS, Stata, SASI Read microarray data, GIS data, chemical structure data

I Excel files can also be read (easiest on Windows)

I Data can be accessed from SQL databases (Postgres,MySQL, Oracle)

Cheminformaticsin R

46/189

The language

Parallel R

Database access

Reading CSV data

I Consider the extract of a CSV file below

ID,Class,Desc1,Desc2,Desc3,Desc4,Desc5

w150,nontoxic,0.37,0.44,0.86,0.44,0.77

c49,toxic,0.37,0.58,0.05,0.85,0.57

s97,toxic,0.66,0.63,0.93,0.69,0.08

I Key features includeI A header line, naming the columnsI An ID column and a Class column, both characters

I Reading this data file is as simple as

> dat <- read.csv("data1.csv", header=TRUE)

I If you print dat it will scroll down your screen

Cheminformaticsin R

47/189

The language

Parallel R

Database access

Reading CSV data

I To get a quick summary of the data.frame use the str()

function

> str(dat)

’data.frame’: 100 obs. of 7 variables:

$ ID : Factor w/ 100 levels "a115","a180",..: 83 7 66 85 64 82 33 10 92 95 ...

$ Class: Factor w/ 2 levels "nontoxic","toxic": 1 2 2 1 2 1 1 1 1 2 ...

$ Desc1: num 0.37 0.37 0.66 0.18 0.88 0.37 0.72 0.72 0.2 0.68 ...

$ Desc2: num 0.44 0.58 0.63 0.09 0.1 0.14 0.03 0.15 0.65 0.66 ...

$ Desc3: num 0.86 0.05 0.93 0.57 0.85 0.67 0.99 0.68 0.77 0.78 ...

$ Desc4: num 0.44 0.85 0.69 0.85 0.58 0.17 0.95 0.19 0.69 0.61 ...

$ Desc5: num 0.77 0.57 0.08 0.17 0.74 0.56 0.22 0.19 0.91 0.13 ...

Cheminformaticsin R

48/189

The language

Parallel R

Database access

What is a factor?

I The factor type is used to represent categorical variablesI Toxic, non-toxicI Active, inactiveI Yes, no, maybe

I They correspond to enum’s in C/Java

I A factor variable looks like an ordinary vector ofcharacters

I Internally they are represented by integersI A factor variable has a property called the levels

I Indicates what values are valid for the factor

> levels(dat$Class)

[1] "nontoxic" "toxic"

Cheminformaticsin R

49/189

The language

Parallel R

Database access

What is a factor?

I If you try to assign an invalid value to a factor you get awarning

> dat$Class[1] <- "Yes"

Warning message:

In ‘[<-.factor‘(‘*tmp*‘, 1, value = "Yes") :

invalid factor level, NAs generated

> dat$Class[1:10]

[1] <NA> toxic toxic nontoxic toxic nontoxic nontoxic nontoxic

[9] nontoxic toxic

Levels: nontoxic toxic

Cheminformaticsin R

50/189

The language

Parallel R

Database access

Operating on groups

I Factors are very useful when you’d like to operate onsubsets of the data, as defined by factor levels

I Assay data might label molecules as active, inactive,inconclusive

I by is similar to lapply but operates on data.frame’s in afactor-wise way

I Takes a function argument which will receive a subset ofthe input data.frame, corresponding to a single level ofthe factor

Cheminformaticsin R

51/189

The language

Parallel R

Database access

assay <- read.csv("data/aid2170.csv", header=TRUE)

str(assay)

levels(assay$PUBCHEM.ACTIVITY.OUTCOME)

## how many actives and inactives are there?

by(assay, assay$PUBCHEM.ACTIVITY.OUTCOME, function(x) {

nrow(x)

})

## summary stats of the inhibiion values, by group


summary(x$Inhibition)

})

## or distributions of the inhibition values, by group

par(mfrow=c(1,2))


hist(x$Inhibition)

})

Cheminformaticsin R

52/189

The language

Parallel R

Database access

Outline

The language

Parallel R

Database access

Cheminformaticsin R

53/189

The language

Parallel R

Database access

Parallel R

I R itself is not multi-threadedI Well suited for embarassingly parallel problems

I Even then, a number of “large data” problems are nottractable

I Recent developments on integrating R and Hadoopaddress this

I See the RHIPE package

I We’ll look at snow which allows distribution ofprocessing on the same machine (multiple CPU’s) ormultiple machines

I But see snowfall for a nice set of wrappers around snow

I Also see multicore for a package that focuses onparallel processing on multicore CPU’s

http://www.stat.purdue.edu/~sguha/rhipe/

http://cran.r-project.org/web/packages/snow/index.html

http://cran.r-project.org/web/packages/snowfall/index.html

http://cran.r-project.org/web/packages/multicore/index.html

Cheminformaticsin R

54/189

The language

Parallel R

Database access

Trivial parallelization with snow

I snow lets you use multiple machines or multiple cores ona single machine

I Well suited for data-parallel problems

I If using multiple machines, data will be sent across thenetwork, so large chunks of data can slow things down

I General strategy isI Initialize clusterI Send variables required for calculation to the nodesI Call a parallel function

Cheminformaticsin R

55/189

The language

Parallel R

Database access

Use case - exhaustive feature selection

I Given a set of N descriptors, find a n-descriptor subsetthat leads to a good predictive model

I Will obviously explode - 50C5 = 2, 118, 760 descriptorcombinations

I Usually we would employ a genetic algorithm or simulatedannealing or even some form of stepwise selection

I For pedagogical purposesI how can we do this exhaustively?I how can we speed it up?

Cheminformaticsin R

56/189

The language

Parallel R

Database access

The data and a model

I Lets consider some melting point data

I Precalculated and reduced descriptor set

I 184 molecules in training set, 34 descriptors

I Goal - find a good 4 descriptor OLS model

0Bergstrom, C.A.S. et al, J. Chem. Inf. Model., 2003, 43, 1177–1185

Cheminformaticsin R

57/189

The language

Parallel R

Database access

A serial solution

I Evaluate all 4-member combinations of the numbers{1, 2, . . . , 34}

I These are the indices of the descriptor columns

I Loop over the rows, use the values as column indices ofthe descriptor matrix, build a model

## First some data prep

library(gtools)

load("data/bergstrom/bergstrom.Rda")

depv <- train.desc[,1]

desc <- train.desc[,-1]

idx <- 1:nrow(desc)

combos <- combinations(ncol(desc), 4)

Cheminformaticsin R

58/189

The language

Parallel R

Database access

A serial solution

rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x))

system.time(

rmse.serial <- apply(combos, 1, function(idx, mydata, y) {

dat <- data.frame(y=y, x=mydata[,idx])

m <- lm(y~., dat)

return(rmse(m$fitted,y))

}, mydata=desc, y=depv)

)

I Takes about 200 seconds

Cheminformaticsin R

59/189

The language

Parallel R

Database access

The parallel solution

I Code is pretty much identical

I First need to initialize the “cluster” - in this case we’llrun multiple processes on the same machine

library(snow)

clus <- makeCluster(rep("localhost",2), type="SOCK")

system.time(

rmse.par <- parApply(clus, combos, 1, function(idx, mydata, y) {

dat <- data.frame(y=y, x=mydata[,idx])

m <- lm(y~., dat)

return(rmse(m$fitted,y))

}, mydata=desc, y=depv)

)

I Takes about 120 seconds, but easily scales up withmultiple cores

Cheminformaticsin R

60/189

The language

Parallel R

Database access

Outline

The language

Parallel R

Database access

Cheminformaticsin R

61/189

The language

Parallel R

Database access

R and databases

I Bindings to a variety of databases are availableI Mainly RDBMS’s but some NoSQL databases are being

interfaced

I The R DBI spec lets you write code that is portable overdatabases

I Note that loading multiple database packages can lead toproblems

I This can happen even when you don’t explicitly load adatabase package

I Some Bioconductor packages load the RSQLite packageas a dependency, which can interfere with, say, ROracle

http://cran.r-project.org/web/packages/DBI/index.html

http://cran.r-project.org/web/packages/RSQLite/index.html

http://cran.r-project.org/web/packages/ROracle/index.html

Cheminformaticsin R

62/189

The language

Parallel R

Database access

A quick look at RSQLite

I Not suitable for very large data or multiple connections

I Definitely look at sqldf, that makes it very easy to useSQL on R data.frames

I Very brief overview of what we can do with RSQLite forlocal storage

I Should transer easily to other databases

http://code.google.com/p/sqldf/

Cheminformaticsin R

63/189

The language

Parallel R

Database access

Basic usage - connecting

I We’ll use a database I prepared and placed atdata/db/assay.db

I Lets first get some information about the tables and fielddefinitions

library(RSQLite)

## specific to RSQLite

sqlite <- dbDriver("SQLite")

con <- dbConnect(sqlite, dbname = "data/db/assay.db")

> dbListTables(con)

[1] "assay"

> dbListFields(con, "assay")

[1] "row_names" "aid" "PUBCHEM_SID" "PUBCHEM_CID"

"PUBCHEM_ACTIVITY_OUTCOME" "PUBCHEM_ACTIVITY_SCORE"

"PUBCHEM_ASSAYDATA_COMMENT"

Cheminformaticsin R

64/189

The language

Parallel R

Database access

Basic usage - querying

I The simplest way to get data is to slurp in the wholetable as a data.frame

I Not a good idea if the table is very large

assay <- dbReadTable(con, "assay")

> str(assay)


$ aid : num 396 396 396 429 429 429 429 429 429 429 ...

$ PUBCHEM_SID : int 10318951 10318962 10319004 842121 842122 842123 842124 842125 842126 842127 ...

$ PUBCHEM_CID : int 6419748 5280343 969516 6603008 6602571 6602616 644371 6603132 2850911 6603374 ...

$ PUBCHEM_ACTIVITY_OUTCOME : chr "active" "active" "active" "inactive" ...

$ PUBCHEM_ACTIVITY_SCORE : int 100 100 100 0 1 1 1 -3 0 0 ...

$ PUBCHEM_ASSAYDATA_COMMENT: int 20061024 20060421 20061024 20061127 20061127 20061127 20061127 20061127 20060622 20061127 ...

Cheminformaticsin R

65/189

The language

Parallel R

Database access

Basic usage - querying

I Better to get subsets of the table via SQL queries

I Two ways to do it - get back all rows from the query orretrieve in chunks

I The latter is better when you might get a lot of rows

## query and get results in one go

dbGetQuery(con, "select distinct aid from assay")

## query and then get results in chunks

res <- dbSendQuery(con, "select * from assay where aid = 429")

fetch(res, n=10) ## get first 10 results

fetch(res, n=10) ## get next 10

fetch(res, n=-1) ## get remaining results

## no more rows to fetch

fetch(res, n=-1)

dbClearResult(res) ## a vital step!

Cheminformaticsin R

66/189

The language

Parallel R

Database access

Saving data in a database

I Can directly write a data.frame to a database file

I RSQLite will automatically generate a CREATE TABLEstatement

library(rpubchem)

## get some assay data from PubChem

assay <- sapply(c(396, 429, 438, 445), function(x) {

data.frame(aid=x, get.assay(x))[,1:6]

}, simplify=FALSE)

## join all assays into a single data.frame

assay <- do.call("rbind", assay)

## Make a new table

sqlite <- dbDriver("SQLite")

con <- dbConnect(sqlite, dbname = "assay.db")

dbWriteTable(con, "assay", assay)

dbDisconnect(con)

Cheminformaticsin R

67/189

The language

Parallel R

Database access

So why use a database?

I Dont have to load bulk CSV or .Rda files each time westart work

I Can index data in RDBMS’s so queries can be very fast

I Good way to exchange data between applications (asopposed to .Rda files which are only useful between Rusers)

I All the code above, except for the dbConnect statementwill be identical for PostgreSQL, MySQL etc.

Cheminformaticsin R

68/189

Introduction

I/O

Fingerprints

SubstructuresPart II

The Chemistry Development Kit

Cheminformaticsin R

69/189

Introduction

I/O

Fingerprints

Substructures

Outline

Introduction

I/O

Fingerprints

Substructures

Cheminformaticsin R

70/189

Introduction

I/O

Fingerprints

Substructures

Resources

I The CDK wiki

I The nightly build page

I Code snippets

I cdk-user mailing list

I Keyword list and feature listI If you want to develop with the CDK, use the

comprehensive jar file that contains all dependenciesI from the nightly build page for the cutting edgeI or from Sourceforge for the last stable release

http://apps.sourceforge.net/mediawiki/cdk/index.php?title=Main_Page

http://pele.farmbio.uu.se/nightly/

http://rguha.net/code/java/

http://pele.farmbio.uu.se/nightly/keywords.html

http://sourceforge.net/apps/mediawiki/cdk/index.php?title=Features

Cheminformaticsin R

71/189

Introduction

I/O

Fingerprints

Substructures

Some CDK issues

I Learning the CDK is non-trivial

I No real tutorial but see here for a start

I Best place is to use the keyword list to find the relevantclass and then look at the unit tests for a class

http://cdk.sourceforge.net/guides/user/pt01.html

http://pele.farmbio.uu.se/nightly/keywords.html

Cheminformaticsin R

72/189

Introduction

I/O

Fingerprints

Substructures

What does the CDK provide?

I Fundamental chemical objectsI atomsI bondsI molecules

I More complex objects are also availableI SequencesI ReactionsI Collections of molecules

I Input/Output for a wide variety of molecular file formats

I Fingerprints and fragment generation

I Rigid alignments, pharmacophore searching

I Substructure searching, SMARTS support

I Molecular descriptors

Cheminformaticsin R

73/189

Introduction

I/O

Fingerprints

Substructures

Some design principles

I Abstraction is the key principleI Applies to chemical objects such as atoms, bondsI Also applied to input/output

I Rather than directly creating an atom object we use afactory method

IAtom atom = DefaultChemObjectBuilder.getInstance().

newAtom();

and not

IAtom atom = new Atom();

Cheminformaticsin R

74/189

Introduction

I/O

Fingerprints

Substructures

Some design principles

I By using DefaultChemObjectBuilder we don’t worryhow an atom or bond is created

I By using this type of abstraction we can load any fileformat without having to specify it

I Another aspect is the use of interfaces rather thanconcrete types

I A molecule is a collection of atoms and bondsI A graph view is not imposed directlyI Molecular features such as aromaticity are properties of

the atoms and bonds

Cheminformaticsin R

75/189

Introduction

I/O

Fingerprints

Substructures

Molecules

I A molecule is fundamentally represented as anIAtomContainer object

I For many purposes this is fineI In some cases specialization is required so we have

I CrystalI RingI Fragment

I Some methods require a specific subclass

I Many methods simply require an IAtomContainer, so youcan use any of the above subclasses

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/IAtomContainer.html

Cheminformaticsin R

76/189

Introduction

I/O

Fingerprints

Substructures

Molecules

I Similar procedure to getting an atom or bond object

IAtomContainer container = DefaultChemObjectBuilder.

getInstance().

newAtomContainer();

I Then we populate it

container.addAtom( atom1 );

container.addAtom( atom2 );

container.addBond( atom1, atom2, 1.5);

Cheminformaticsin R

77/189

Introduction

I/O

Fingerprints

Substructures

Outline

Introduction

I/O

Fingerprints

Substructures

Cheminformaticsin R

78/189

Introduction

I/O

Fingerprints

Substructures

Input/Output

I SMILES are a common format

I SMILES parsing, you will have to handle theInvalidSmilesException

SmilesParser sp = new SmilesParser(

DefaultChemObjectBuilder.

getInstance());

IMolecule molecule = sp.parseSmiles("CCCC(CC)CC=CC=CC");

I Given a molecule object, we can create a SMILES

SmilesGenerator sg = new SmilesGenerator();

String smiles = sg.createSMILES(molecule);

System.out.println("SMILES = "+smiles);

Cheminformaticsin R

79/189

Introduction

I/O

Fingerprints

Substructures

Input/Output

I Reading files from disk is a common task

I A little more complex due to abstraction but is quitegeneral

File file = new File("mols.sdf");

FileReader fileReader = new FileReader(file);

MDLV2000Reader reader = new MDLV2000Reader(fileReader);

IChemFile chemFile = (IChemFile) reader.

read(new ChemFile());

List containers = ChemFileManipulator.

getAllAtomContainers(chemFile);

Cheminformaticsin R

80/189

Introduction

I/O

Fingerprints

Substructures

Input/Output

I This is nice since it allows you to get all the molecules inthe file at one go

I Not a good idea for very large SD files

I In such cases, use IteratingMDLReader

I Allows you to iterate over arbitrarily large SD files

I There is also an IteratingSMILESReader for big SMILESfiles

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/io/iterator/IteratingMDLReader.html

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/io/iterator/IteratingSMILESReader.html

Cheminformaticsin R

81/189

Introduction

I/O

Fingerprints

Substructures

Input/Output

I Writing files is also supported for a wide variety offormats

I For example, to write to SD format

FileWriter w1 = new FileWriter(new File("molecule.sdf"));

try {

MDLWriter mw = new MDLWriter(w1);

mw.write(molecule);

mw.close();

} catch (Exception e) {

System.out.println(e.toString());

}

Cheminformaticsin R

82/189

Introduction

I/O

Fingerprints

Substructures

SD tags

I SD tags are a common way to associate propertyinformation with a molecular structure

I See PubChem SD files to get an idea of what we can putin them

CDK 1/28/07,15:24

2 1 0 0 0 0 0 0 0 0999 V2000

0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

1 2 1 0 0 0 0

M END

> <myProperty>

1.2

> <anotherProperty>

Hello world

Cheminformaticsin R

83/189

Introduction

I/O

Fingerprints

Substructures

SD tags

I When a molecule is read in from an SD file we can getthe value of a given tag by doing

Object value = molecule.getProperty("myProperty");

I When writing a molecule we can supply a set of tags in aHashMap

HashMap tags = new HashMap();

tags.put("myProperty", new Double(1.2));

Cheminformaticsin R

84/189

Introduction

I/O

Fingerprints

Substructures

SD tags

I To ensure tags are written to disk we would do

FileWriter w1 = new FileWriter(new File("molecule.sdf"));

try {


mw.setSdFields( tags );

mw.write(molecule);

mw.close();

} catch (Exception e) {

System.out.println(e.toString());

}

Cheminformaticsin R

85/189

Introduction

I/O

Fingerprints

Substructures

SD tags

I A molecule might have several properties associated withit.

I We can avoid creating an extra HashMap, when writingthem all to disk


mw.setSdFields( molecule.getProperties() );

mw.write(molecule);

mw.close();

Cheminformaticsin R

86/189

Introduction

I/O

Fingerprints

Substructures

Generating canonical SMILES

I The SmilesGenerator class will create canonical SMILES

String smiles = "C(C)(C)CC=CC(C(CC(C))CC)CC";

SmilesParser sp = new SmilesParser();

IMolecule mol = sp.parseSmiles(smiles);

SmilesGenerator sg = new SmilesGenerator();

String canSmi = sg.createSMILES(mol);

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/smiles/SmilesGenerator.html

Cheminformaticsin R

87/189

Introduction

I/O

Fingerprints

Substructures

Fingerprints

1 0 1 1 0 0 0 1 0

I Used in many areas - database screening, diversityanalysis, modeling

Cheminformaticsin R

88/189

Introduction

I/O

Fingerprints

Substructures

Outline

Introduction

I/O

Fingerprints

Substructures

Cheminformaticsin R

89/189

Introduction

I/O

Fingerprints

Substructures

Fingerprints

I The simplest fingerprint looks for specific substructuresand sets a specific bit

I MACCSFingerprinterI SubstructureFingerprinter

I Hashed fingerprints don’t maintain a correspondencebetween bit position and substructural feature

Cheminformaticsin R

90/189

Introduction

I/O

Fingerprints

Substructures

Getting fingerprints

I The CDK can generate binary fingerprints

I This gives a default fingerprint of 1024 bits - but you canspecify smaller or larger fingerprints

I The CDK also provides variants of the default fingerprintI MACCSFingerprintI ExtendedFingerPrinter - includes bits related to ringsI GraphOnlyFingerprinter - simplified version of

fingerprinter that ignores bond orderI SubstructureFingerprinter - generates fingerprints based

on a specified set of substructures

IFingerprinter fprinter = new Fingerprinter()

BitSet fingerprint = fprinter.getFingerprint(molecule);

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/fingerprint/IFingerprinter.html

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/fingerprint/MACCSFingerprinter.html

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/fingerprint/ExtendedFingerprinter.html

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/fingerprint/GraphOnlyFingerprinter.html

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/fingerprint/SubstructureFingerprinter.html

Cheminformaticsin R

91/189

Introduction

I/O

Fingerprints

Substructures

Evaluating similarity

I Given two molecules we are interested in determininghow similar they are

I A common approach is to evaluate their fingerprints

I Calculate a similarity metric (e.g., Tanimoto coefficient)

BitSet fp1 = Fingerprinter.getFingerprint(molecule1);

BitSet fp2 = Fingerprinter.getFingerprint(molecule2);

float tc = Tanimoto.calculate(fp1, fp2);

Cheminformaticsin R

92/189

Introduction

I/O

Fingerprints

Substructures

Outline

Introduction

I/O

Fingerprints

Substructures

Cheminformaticsin R

93/189

Introduction

I/O

Fingerprints

Substructures

Substructure searching

I Identify the presence/absence of a substructure in amolecule

IAtomContainer mol = ...;

SMARTSQueryTool sqt = new SMARTSQueryTool("C");

sqt.setSmarts("C(=O)C");

boolean matched = sqt.matches(mol);

I In many cases we’re interested in all the possible matches

List<List<Integer>> maps;

maps = sqt.getUniqueMatchingAtoms();

I maps.size() gives you the number of matches

I Each element of maps is a sequence of atom ID’s

Cheminformaticsin R

94/189

Introduction

I/O

Fingerprints

Substructures

Pharmacophores

I We’ll look at constructing a pharmacophore query byhand

I Generally, you’d read it from a file

I Given a query we can search for that in a target molecule

I The target molecule should have 3D coordinates

I Generally you’ll examine multiple conformers for a givenmolecule

Cheminformaticsin R

95/189

Introduction

I/O

Fingerprints

Substructures

Pharmacophores

I Construct pharmacophore groups

PharmacophoreQueryAtom o =

new PharmacophoreQueryAtom("D", "[!H0;#7,#8,#9]");

PharmacophoreQueryAtom n1 =

new PharmacophoreQueryAtom("A", "c1ccccc1");

PharmacophoreQueryAtom n2 =

new PharmacophoreQueryAtom("A", "c1ccccc1");

Cheminformaticsin R

96/189

Introduction

I/O

Fingerprints

Substructures

Pharmacophores

I Construct pharmacophore constraints

PharmacophoreQueryBond b1 =

new PharmacophoreQueryBond(o, n1, 4.0, 4.5);


new PharmacophoreQueryBond(o, n2, 4.0, 5.0);


new PharmacophoreQueryBond(n1, n2, 5.4, 5.8);

Cheminformaticsin R

97/189

Introduction

I/O

Fingerprints

Substructures

Pharmacophores

I Given pharmacophore groups and constraints, weconstruct a pharmacophore query

I Design to have the same semantics as a normal molecule

QueryAtomContainer query =

new QueryAtomContainer();

query.addAtom(o);

query.addAtom(n1);

query.addAtom(n2);

query.addBond(b1);

query.addBond(b2);

query.addBond(b3);

Cheminformaticsin R

98/189

Introduction

I/O

Fingerprints

Substructures

Pharmacophores

I Given the query we can now perform a pharmacophoresearch

IAtomContainer aMolecule;

// load in the molecule

PharmacophoreMatcher matcher =

new PharmacophoreMatcher(query);

boolean status = matcher.matches(aMolecule);

if (status) {

// get the matching pharmacophore groups

List<List<PharmacophoreAtom>> pmatches =

matcher.getUniqueMatchingPharmacophoreAtoms();

}

Cheminformaticsin R

99/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Part III

Cheminformatics & R - rcdk

Cheminformaticsin R

100/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Using the CDK in R

I Based on the rJava package

I Two R packages to install (not counting thedependencies)

I Provides access to a variety of CDK classes and methods

I Idiomatic R

R Programming Environment

rJava

CDK Jmol

rcdk

XML

rpubchem

fingerprint

Cheminformaticsin R

101/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Outline

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Cheminformaticsin R

102/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Reading in data

I The CDK supports a variety of file formats

I rcdk loads all recognized formats, automatically

I Data can be local or remote

mols <- load.molecules( c("data/io/set1.sdf",

"data/io/set2.smi",

"http://rguha.net/rcdk/remote.sdf"))

I Gives you a list of Java references representingIAtomContainer objects

I Can’t do much with these objects, except via rcdkfunctions

Cheminformaticsin R

103/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Reading structures from text files

I SMILES strings can be contained in arbitrary text files(such as CSV)

I If you read such a file with read.table or read.csv, makesure to set comment.char=""

I Also a good idea to specify the colClasses argument orelse as.is=TRUE- otherwise SMILES will be read in asfactors

Cheminformaticsin R

104/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Reading SMILES strings

I Also useful to load molecules directly from a SMILESstring

I parse.smiles parses a single molecule

mol <- parse.smiles("c1ccccc1C(=O)N")

## loop over multiple SMILES

smiles <- c("CCC", "c1ccccc1",

"C(C)(C=O)C(CCNC)C1CC1C(=O)")

mols <- sapply(smiles, parse.smiles)

I But the second call is inefficient, especially for largeSMILES collections

Cheminformaticsin R

105/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Efficient SMILES parsing

I By default parse.smiles instantiates a SMILES parserobject (SmilesParser)

I So when looping over a vector of SMILES strings we’recreating a parser object each time

I In such a case, specify a parser object explicitly - 2Xspeedup

smilesParser <- get.smiles.parser()

## now specify this parser

smiles <- c("CCC","c1ccccc1","C(C)(C=O)C(CCNC)C1CC1C(=O)")

mols <- sapply(smiles, parse.smiles, parser = smilesParser)

Cheminformaticsin R

106/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Efficient SMILES parsing

> parser <- get.smiles.parser()

> smiles <- rep(c("CCC",

"c1ccccc1",

"C(C)(C=O)C(CCNC)C1CC1C(=O)"),

500)

> system.time(junk <- sapply(smiles, parse.smiles))

user system elapsed

4.088 0.106 3.946

> system.time(junk <- sapply(smiles, parse.smiles, parser=parser))

user system elapsed

1.792 0.032 1.705

Cheminformaticsin R

107/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

You get what is provided . . . and no more

I CDK philosophy is not to do extra work

I For file reading this implies that it only parses what isspecified in the file and will not perform operations thatare not explicitly indicated

I As a result some features (isotopic masses) of moleculesand atoms are not automatically configured

I load.molecule will do these extra steps

I parse.smiles will not

Cheminformaticsin R

108/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Configuring molecules

## no explicit configuration needed

mols <- load.molecules("data/io/bp.smi")

get.exact.mass(mols[[1]])

## explicit configuration required

mol <- parse.smiles("c1ccccc1")

get.exact.mass(mol)

I The last call will give a NullPointerException sincethe isotopic masses have not been configured

Cheminformaticsin R

109/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Configuring molecules

I When parsing SMILES, do the configuration by hand

smi <- "OC(=O)C1=C(C=CC=C1)OC(=O)C"

mol <- parse.smiles(smi)

## do the configuration

do.typing(mol) ## not required in recent versions

do.aromaticity(mol) ## not required in recent versions

do.isotopes(mol)

## now, this works

get.exact.mass(mol)

Cheminformaticsin R

110/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Hydrogens - implicit & explicit

I SMILES indicates that when hydrogens are not specified,they are to be considered implicit

I But MDL MOL files have no such specification, so if noH’s specified, the molecule will not have explicit orimplicit hydrogens

mol <- load.molecules("data/noh.mol")[[1]]

unlist(lapply(get.atoms(mol), get.symbol))

I In such cases, convert.implicit.to.explicit willadd implicit H’s (to satisfy valencies) and then convertthem to explicit

convert.implicit.to.explicit(mol)

unlist(lapply(get.atoms(mol), get.symbol))

Cheminformaticsin R

111/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

A note on function calls

I R uses copy-on-write semantics for function callsI If a function modifies its input argument, R will send a

copy of the input variable to the functionI Otherwise send a referenceI “Safety” of call-by-value, efficiency of call-by-reference

I In contrast, convert.implicit.to.explicit modifiesthe input argument

I No return value

I In general rJava allows you to work with Java objectsusing call-by-reference semantics

Cheminformaticsin R

112/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Writing molecules

I Currently only SDF is supported as an output file format

I By default a multi-molecule SDF will be written

I Properties are not written out as SD tags by default

smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")

mols <- sapply(smis, parse.smiles)

## all molecules in a single file

write.molecules(mols, filename="mols.sdf")

## ensure molecule data is written out

write.molecules(mols, filename="mols.sdf", write.props=TRUE)

## molecules in individual files

write.molecules(mols, filename="mols.sdf", together=FALSE)

Cheminformaticsin R

113/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Generating SMILES

I Finally, we can also create SMILES strings frommolecules via get.smiles

I Works on a single molecule at a time

I Canonical SMILES are generated by default

smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")

mols <- sapply(smis, parse.smiles)

lapply(mols, get.smiles)

Cheminformaticsin R

114/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Outline

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Cheminformaticsin R

115/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Accessing attached data

I Some file formats allow you to attach data to a structure

I SDF is a good example, using named tags and associatedvalues

I For example, a DrugBank SDF has fields such as

> <DRUGBANK_ID>

DB00135

> <DRUGBANK_GENERIC_NAME>

L-Tyrosine

> <DRUGBANK_MOLECULAR_FORMULA>

C9H11NO3

> <DRUGBANK_MOLECULAR_WEIGHT>

181.1885

http://www.drugbank.ca/

Cheminformaticsin R

116/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Accessing data fields

I The CDK reads SDF tags and their values into theIAtomContainer object

I We can access them via the get.property andget.properties

> mols <- load.molecules("data/io/set1.sdf")

> get.properties(mols[[1]])

$DRUGBANK_ID

[1] "DB00135"

$DRUGBANK_IUPAC_NAME

[1] "(2S)-2-amino-3-(4-hydroxyphenyl)propanoic acid"

$‘cdk:Title‘

[1] "135"

...

Cheminformaticsin R

117/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Accessing data fields

I If you know the name of the field, just pull the valuedirectly

get.property(mols[[1]], "DRUGBANK_GENERIC_NAME")

I Note that values for all tags are character, so you needto cast them to the appropriate types manually

Cheminformaticsin R

118/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Extra tags

I In some cases, you’ll see tag names that are not in theSDF file

I The CDK uses molecule property fields to attacharbitrary data.

I For example, the title field from a SDF entry is storedusing the cdk:Title name

I In general, all CDK-derived properties have namesprefixed by “cdk:”

## identify CDK-derived properties

props <- get.properties(mols[[1]])

props[ grep("^cdk:", names(props)) ]

Cheminformaticsin R

119/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Setting property values

I You can add your own property values

I The type of the value is preserved, as long as you arewithin R

I The type of the name (or key) must be character

mol <- parse.smiles("c1ccccc1Cc2ccccc2")

set.property(mol, "MyProperty", 23.14)

set.property(mol, "toxic", "yes")

get.properties(mol)

Cheminformaticsin R

120/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Persisting molecular data

I You could extract the properties and write them to a textfile

mols <- load.molecules("data/io/set1.sdf")

props <- lapply(mols, get.properties)

## convert list to table and write it out

props <- do.call("rbind", props)

write.table(props, "drugdata.txt", row.names=FALSE)

Cheminformaticsin R

121/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Persisting molecular data

I Molecular data can also be persisted by writing to SDFformat

mol <- parse.smiles("c1ccccc1")

set.property(mol, "foo", 1)

set.property(mol, "bar", "baz")

write.molecules(mol, "mol.sdf", write.props=TRUE)

Cheminformaticsin R

122/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Working with molecules

I Currently you can access atoms, bonds, get certain atomproperties, 2D/3D coordinates

I Since rcdk doesn’t cover the entire CDK API, you mightneed to drop down to the rJava level and make calls tothe Java code by hand

I I’m happy to code specific tasks in idiomatic R

Cheminformaticsin R

123/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Working with molecules

I A useful task is to wash or standardize molecules

I CDK doesn’t provide a method to do this directly

I But a common operation is to check for disconnectedmolecules and keep the largest fragment

> m <- parse.smiles("c1ccccc1")

> is.connected(m)

[1] TRUE

>

> m <- parse.smiles("C1CC(N=C1)C(=O)[O-].[Na+]")

> is.connected(m)

[1] FALSE

>

> largest <- get.largest.component(m)

> length(get.atoms(largest))

[1] 8

Cheminformaticsin R

124/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Getting atoms & bonds

I Simple to get atoms and bonds

mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")

atoms <- get.atoms(mol)

bonds <- get.bonds(mol)

I These are lists of jobjRef objects

I rcdk currently supports getting the symbol, coordinates,atomic number, implicit H-count

I Can also check whether an atom is aromatic etc.

I ?get.symbol

Cheminformaticsin R

125/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Working with atoms

I Simple elemental analysis

I Identifying flat molecules

mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")

atoms <- get.atoms(mol)

## elemental analysis

syms <- unlist(lapply(atoms, get.symbol))

round( table(syms)/sum(table(syms)) * 100, 2)

## is the molecule flat?

coords <- do.call("rbind", lapply(atoms, get.point3d))

any(apply(coords, 2, function(x) length(unique(x)) == 1))

Cheminformaticsin R

126/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

SMARTS matching

I rcdk supports substructure searches with SMARTS orSMILES

I May not be practical for large collections of moleculesdue to memory

mols <- sapply(c("CC(C)(C)C",

"c1ccc(Cl)cc1C(=O)O",

"CCC(N)(N)CC"), parse.smiles)

query <- "[#6D2]"

hits <- matches(query, mols)

> print(hits)

CC(C)(C)C c1ccc(Cl)cc1C(=O)O CCC(N)(N)CC

FALSE TRUE TRUE

Cheminformaticsin R

127/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Persisting objects

I We already saw how we can persist molecule properties

I It’d be useful to also persist CDK objects (i.e.,jobjRef’s)

m <- parse.smiles("c1ccccc1")

obj <- .jserialize(m)

newObj <- .junserialize(obj)

I Serialized objects are just raw vectors, so can be savedvia save

I But also take a look at .jcache

http://stat.ethz.ch/R-manual/R-patched/library/base/html/save.html

http://www.rforge.net/doc/packages/rJava/jserialize.html

Cheminformaticsin R

128/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Outline

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Cheminformaticsin R

129/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Visualization

I rcdk supports visualization of 2D structure images intwo ways

I First, you can bring up a Swing window

I Second, you can obtain the depiction as a raster image

I Doesn’t work on OS X

mols <- load.molecules("data/dhfr_3d.sd")

## view a single molecule in a Swing window

view.molecule.2d(mols[[1]])

## view a table of molecules

view.molecule.2d(mols[1:10])

Cheminformaticsin R

130/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Visualization

Cheminformaticsin R

131/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Visualization

I The Swing window is a little heavy weight

I It’d be handy to be able to annotate plots with structures

I Or even just make a panel of images that could be savedto a PNG file

I We can make use of rasterImage and rcdk

I As with the Swing window, this won’t work on OS X

http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/raster.html

Cheminformaticsin R

132/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Visualization

m <- parse.smiles("c1ccccc1C(=O)NC")

img <- view.image.2d(m, 200,200)

## start a plot

plot(1:10, 1:10, pch=19)

## overlay the structure

rasterImage(img, 1,8, 3,10)

Cheminformaticsin R

133/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Visualization

I By playing around with plotting parameters we can fill upthe plot region with an image

I The values being plotted define the coordinates used tooverlay the raster image


## A table of 16 structures

par(mfrow=c(4,4), mar=c(0,0,0,0))

for (i in 1:16) {

img <- view.image.2d(mols[[i]], 200, 200)

plot(1:10, xaxt="n", yaxt="n", xaxs="i", yaxs="i", col="white")

rasterImage(img, 1,1, 10,10)

}

Cheminformaticsin R

134/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Outline

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Cheminformaticsin R

135/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Molecular descriptors

I Numerical representations of chemical structure featuresI Can be based on

I connectivityI 3D coordnatesI electronic propertiesI combination of the above

I Many descriptors are described and implemented invarious forms

I The CDK implements 45 descriptor classes, resulting in≈ 300 individual descriptor values for a given molecule

Cheminformaticsin R

136/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Descriptor caveats

I Not all descriptors are optimized for speed

I Some of the topological descriptors employ graphisomorphism which makes them slow on large molecules

I In general, to ensure that we end up with a rectangulardescriptor matrix we do not catch exceptions

I Instead descriptor calculation failures return NA

Cheminformaticsin R

137/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

CDK Descriptor Classes

I The CDK provides 3 packages for descriptor calculationsI

org.openscience.cdk.qsar.descriptors.molecularI org.openscience.cdk.qsar.descriptors.atomicI org.openscience.cdk.qsar.descriptors.bond

I rcdk only supports molecular descriptorsI Each descriptor is also described by an ontology

I For rcdk this is used to classify descriptors into groups

Cheminformaticsin R

138/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Descriptor calculations

I Can evaluate a single descriptor or all availabledescriptors

I If a descriptor cannot be calculated, NA is returned (so noexceptions thrown)

I First need to get available descriptor class names

dnames <- get.desc.names()

I Gives you a vector of all descriptor class names

Cheminformaticsin R

139/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Descriptor calculations

I You can also specify descriptor categories, to get a subset

dnames <- get.desc.names("topological")

## what categories are available?

> get.desc.categories()

[1] "electronic" "protein" "topological" "geometrical" "constitutional" "hybrid"

I Depending on the input structures, some descriptors maynot make sense

I If you evaluate 3D descriptors from SMILES input, they’dall be set to NA

Cheminformaticsin R

140/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Evaluating the descriptors

I With a set of descriptor class names in hand, you canevaluate them

descs <- eval.desc(mols, dnames)

I descs will be a data.frame which can then be used in anyof the modeling functions

I Column names are the descriptor names provided by theCDK

I Good practise to put molecule titles in a column or in therow names

Cheminformaticsin R

141/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

The QSAR workflow

Cheminformaticsin R

142/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

The QSAR workflow

I Before model development you’ll need to clean themolecules, evaluate descriptors, generate subsets

I With the numeric data in hand, we can proceed tomodeling

I Before building predictive models, we’d probably explorethe dataset

I Normality of the dependent variableI Correlations between descriptors and dependent variableI Similarity of subsets

I Then we can go wild and build all the models that Rsupports

Cheminformaticsin R

143/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Outline

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Cheminformaticsin R

144/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Accessing fingerprints

I CDK provides several fingerprintsI Path-based, MACCS, E-State, PubChem

I Access them via get.fingerprint(...)

I Works on one molecule at a time, use lapply to process alist of molecules

I This method works with the fingerprint packageI Separate package to represent and manipulate fingerprint

data from various sources (CDK, BCI, MOE)I Uses C to perform similarity calculationsI Lots of similarity and dissimilarity metrics available

http://stat.ethz.ch/R-manual/R-patched/library/base/html/lapply.html

Cheminformaticsin R

145/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Accessing fingerprints


## get a single fingerprint

fp <- get.fingerprint(mols[[1]], type="maccs")

## process a list of molecules

fplist <- lapply(mols, get.fingerprint, type="maccs")

I Each fingerprint is an S4 object

I See the fingerprint package man pages for moredetails

Cheminformaticsin R

146/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Reading fingerprints

I We can also read in fingerprint data generated by othertools - BCI, MOE, CDK

fplist <- fp.read("data/fp/fp.data", size=1052,

lf=bci.lf, header=TRUE)

I Easy to support new line-oriented fingerprint formats byproviding your own line parsing function

I See bci.lf for an example

http://rss.acs.unt.edu/Rdoc/library/fingerprint/html/linefunc.html

Cheminformaticsin R

147/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Similarity metrics

I The fingerprint package implements 28 similarity anddissimilarity metrics

I All accessed via the distance function (so you call thiseven if you want similarity)

I Implemented in C, but still, large similarity matrixcalculations are not a good idea!

## similarity between 2 individual fingerprints

distance(fplist[[1]], fplist[[2]], method="tanimoto")

distance(fplist[[1]], fplist[[2]], method="mt")

## similarity matrix - compare similarity distributions

m1 <- fp.sim.matrix(fplist, "tanimoto")

m2 <- fp.sim.matrix(fplist, "carbo")

par(mfrow=c(1,2))

hist(m1, xlim=c(0,1))

hist(m2, xlim=c(0,1))

Cheminformaticsin R

148/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Comparing datasets with fingerprints

I We can compare datasets based on a fingerprints

I Rather than perform pairwise comparisons, we evaluatethe normalized occurence of each bit, across the dataset

I Gives us a n-D vector - the “bit spectrum”

bitspec <- bit.spectrum(fplist)

plot(bitspec, type="l")

0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384

Cheminformaticsin R

149/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Bit spectrum

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Bit Position

Fre

quen

cy

I Only makes sense with structural key type fingerprints

I Longer fingerprints give better resolution

I Comparing bit spectra, via any distance metric, allows usto compare datasets in O(n) time, rather than O(n2) fora pairwise approach

0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384

Cheminformaticsin R

150/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Comparing datasets

Bit Position

Nor

mal

ized

Fre

quen

cy

0.0

0.2

0.4

0.6

0.8

1.0

0 50 100 150

Entire Dataset

0.0

0.2

0.4

0.6

0.8

1.0

Subset 1

0.0

0.2

0.4

0.6

0.8

1.0

Subset 2

Cheminformaticsin R

151/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Fingerprints & clustering

I Clustering large collections is not practical

I Any way to avoid a pairwise similarity matrix is desirable

## Get fingerprints


fplist <- lapply(mols, get.fingerprint, type="maccs")

## Gives us a similarity matrix

fpdist <- fp.sim.matrix(fplist)

## Convert to a distance matrix

fpdist <- as.dist(1-fpdist)

## Perform the clustering

clus <- hclust(fpdist, method="ward")

Cheminformaticsin R

152/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Fingerprints & clustering

05

1015

2025

3035

Hierarchical Clustering of the Sutherland DHFR Dataset

Cheminformaticsin R

153/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Comparing fingerprint performance

I Various studies comparing virtual screening methods

I Generally, metric of success is how many actives areretrieved in the top n% of the database

I Can be measured using ROC, enrichment factor, etc.I Exercise - evaluate performance of CDK fingerprints

using enrichment factorsI Load active and decoy moleculesI Evaluate fingerprintsI For each active, evaluate similarity to all other molecules

(active and inactive)I For each active, determine enrichment at a given

percentage of the database screened

For a given query molecule, order dataset by decreasingsimilarity, look at the top 10% and determine fraction of

actives in that top 10%

Cheminformaticsin R

154/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments


ActivesInactives

Query Targets

Similarity

Cheminformaticsin R

155/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments


I A good dataset to test this out is the MaximumUnbiased Validation datasets by Rohr & Baumann

I Derived from 17 PubChem bioassay datasets anddesigned to avoid analog bias and artifical enrichment

I As a result, 2D fingerprints generally show poorperformance on these datasets (by design)

I See here for a comparison of various fingerprints usingtwo of these datasets

0Rohrer, S.G et al, J. Chem. Inf. Model, 2009, 49, 169–1840Good, A. & Oprea, T., J. Chem. Inf. Model, 2008, 22, 169–1780Verdonk, M.L. et al, J. Chem. Inf. Model, 2004, 44, 793–806

http://www.pharmchem.tu-bs.de/lehre/baumann/MUV.html

http://www.pharmchem.tu-bs.de/lehre/baumann/MUV.html

http://dx.doi.org/10.1021/ci8002649

http://dx.doi.org/10.1007/s10822-007-9167-2

http://dx.doi.org/10.1021/ci034289q

http://rguha.wordpress.com/2008/10/11/do-the-cdk-fingerprints-work/

Cheminformaticsin R

156/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Comparing fingerprint performance0.

00.

20.

40.

60.

81.

0

Query Molecule

Sim

ilarit

y

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Cheminformaticsin R

157/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Outline

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Cheminformaticsin R

158/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Fragment based analysis

I Fragment based analysis can be a useful alternative toclustering, especially for large datasets

I Useful for identifying interesting seriesI Many fragmentation schemes are available

I ExhaustiveI Rings and ring assembliesI Murcko

I The CDK supports fragmentation (still needs work) intoMurcko frameworks and ring systems

I We can also use the NCGC fragmenter (seedata/fragmenter.jar) to exhaustively generatefragments

Cheminformaticsin R

159/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Getting fragments

I rcdk currently wraps the GenerateFragments class

I Only exposes the Murcko fragmentation method

I Lets look at directly working a CDK class and itsmethods

I Once we have the fragmenter we can perform a Murckofragmentation and get back the fragments as SMILES

Cheminformaticsin R

160/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Getting fragments

mol <- parse.smiles(

"c1cc(c(cc1c2c(nc(nc2CC)N)N)[N+](=O)[O-])NCc3ccc(cc3)C(=O)N4CCCCC4")

## get the fragmenter

fragmenter <- .jnew("org/openscience/cdk/tools/GenerateFragments")

## do fragmentation

.jcall(fragmenter, "V", "generateMurckoFragments",

.jcast(mol, "org/openscience/cdk/interfaces/IMolecule"),

TRUE, TRUE, as.integer(6))

## get fragments as SMILES

frags <- .jcall(fragmenter,

"[S",

"getMurckoFrameworksAsSmileArray")

Cheminformaticsin R

161/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Some caveats

I The fragment SMILES come out as non-aromatic

I There is supposed to be a way to get the appropriateSMILES out, currently looking into this

I Implies that if we want to look at fragment properties,we’re stuck

I However, we can still use these to look at series and/orlocal trends etc.

Cheminformaticsin R

162/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Aggregating fragments

I Lets wrap the fragmentation procedure into a function

get.frags <- function(mol, fragmenter) {

.jcall(fragmenter, "V", "generateMurckoFragments",

.jcast(mol, "org/openscience/cdk/interfaces/IMolecule"),

TRUE, TRUE, as.integer(6))

return(.jcall(fragmenter, "[S", "getMurckoFrameworksAsSmileArray"))

}

Cheminformaticsin R

163/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments


I Now, we can get fragments for a set of molecules

mols <- load.molecules("data/dfhr_3d.sd")

fragmenter <- .jnew("org/openscience/cdk/tools/GenerateFragments")

frags <- lapply(mols, function(x, f) {

get.frags(x, f)

}, f=fragmenter)

Cheminformaticsin R

164/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments


I This gives us a list of fragment SMILES

> frags[1:3]

[[1]]

[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"

[[2]]

[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"

[[3]]

[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"

I What we want is a data.frame that associates fragments,fragment ID’s and the SMILES from which they arederived

Cheminformaticsin R

165/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments


frags <- lapply(mols, function(x, f) {

tmp <- get.frags(x, f)

tmp <- data.frame(frag=I(tmp),

mol=I(get.property(x, "cdk:Title")))

return(tmp)

}, f=fragmenter)

frags <- do.call("rbind", frags)

I Now we have something useful to work with

> head(frags)

frag mol

1 C1=CC=C(C=C1)C=2C=NC=NC=2 1-pyrimethamine

2 C1=CC=C(C=C1)C=2C=NC=NC=2 1-3062

3 C1=CC=C(C=C1)C=2C=NC=NC=2 1-7364

4 C1=CC=C(C=C1)C=2C=NC=NC=2 1-115194

5 C1=CC=C(C=C1)OCC=2C=CN=CN=2 1-118203

6 C1=CC=C(C=C1)C=2C=NC=NC=2 1-118203

Cheminformaticsin R

166/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments


I Finally we generate some arbitrary fragment ID’s

ufrags <- data.frame(frag=I(unique(frags$frag)),

fid=I(sprintf("F%03d")),

1:length(unique(frags$frag)))))

frags <- merge(frags, ufrags, by="frag")

I . . . giving us

> tail(frags)

frag mol fid

963 O=C2NC=NC=3NC=C(CNC1=CC=CC=C1)C2=3 43-13 F239

964 O=C2NC4=NC=NC=C4(C=3CN(CC1=CC=CC=C1)CC2=3) 49-4 F265

965 O=S(=O)(C1=CC=CC=C1)C=2C=CC3=NC=NC=C3(N=2) 22-8 F150



968 O=S(=O)(NCCOC1=CC=CC=C1)C2=CC=CC=C2 1-122059 F068

Cheminformaticsin R

167/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Doing stuff with fragments

I Look at frequency of occurence of fragments

I Develop predictive models on fragment members, lookingfor local SAR

I Pseudo-cluster a dataset based on fragments

I Compound selection based on fragment membership

frag.counts <- by(frags, frags$frag, nrow)

barplot(frag.counts)

Cheminformaticsin R

168/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Characterizing fragment SAR

I Consider some of the larger series and see if there is anSAR within each of them

I StrategyI Identify fragments with 20 to 30 membersI Merge the fragment data with descriptor data for the

associated moleculesI For each fragment series, build an OLS model using

exhaustive feature selection

I I’ve precalculated a set of descriptors for this purpose (din data/frags.Rda) or you can generate fragmentsyourself

Cheminformaticsin R

169/189

I/O

Molecular data

Visualization

Descriptors

Fingerprints

Fragments

Characterizing fragment SAR

Observed Activity

Pre

dict

ed A

ctiv

ity

−0.5

0.0

0.5

1.0

1.5

2.0

−1 0 1 2

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

F018

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

F019

−1 0 1 2

● ●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

F052

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

F128

−1 0 1 2

−0.5

0.0

0.5

1.0

1.5

2.0

●

●

●

●●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

F252

Predicted versus observed activity for 5 fragment series, basedon OLS models and exhaustive feature selection

Cheminformaticsin R

170/189

Part IV

Cheminformatics & R - rpubchem

Cheminformaticsin R

171/189

Public chemical and biological data

Cheminformaticsin R

172/189

Working with PubChem data

Bioassay data (numeric)

Assay metadata(text)

Chemical structures

Process

Extract

Process

StandardizeCleanDescriptors

ProcessExtractMerge

Model

Cheminformaticsin R

173/189

Working with PubChem data in R

I Access PubChem datasets using idiomatic R

I Minimize extra work (merge data, add metadata etc)

I End up with a data.frame, ready for modelingI The usual sequence of tasks would involve

I Get the assay dataI Get structure data associated with the assayI Proceed with modeling

Cheminformaticsin R

173/189

Working with PubChem data in R

I Access PubChem datasets using idiomatic R

I Minimize extra work (merge data, add metadata etc)

I End up with a data.frame, ready for modelingI The usual sequence of tasks would involve

I Get the assay dataI Get structure data associated with the assayI Proceed with modeling

Cheminformaticsin R

174/189

Getting assay data

I The workhorse method is get.assay which takes theAID as argument

assay <- get.assay(2018, quiet=FALSE)

I Returns a data.frame derived from the assay data CSVfile and the assay description XML file form PubChem

I The description, comments and field descriptions andtypes are available as attributes of the data.frame

I Note - no structures are included at this point

Cheminformaticsin R

175/189

Other ways to get assay data

I If you’re looking for a group of assays, say related toHDAC inhibitors, perform a keyword search

## gives a vector of integer AIDs

aids <- find.assay.id("HDAC", quiet=FALSE)

## what are the assays?

sapply(aids[1:10], get.assay.desc)

Cheminformaticsin R

176/189

Other ways to get assay data

I Find assays that a compound is active in

cid <- 68368

aids <- get.aid.by.cid(cid, type="active")

I Gives a vector AIDs that the compound is active in

I If you specify type="raw", a summary is returned

Cheminformaticsin R

177/189

Getting structure informationI Get structure data by CID or SID

cids <- get.cid(assay$PUBCHEM.CID[1:10])

I Gives you a data.frame containing name, structure andsome properties

> str(structs)


$ CID : chr "644436" "645300" "645983" "648874" ...

$ IUPACName : chr "2-(2-methylphenyl)-1,3-benzoxazol-5-amine"

$ CanonicalSmile : chr "CC1=CC=CC=C1C2=NC3=C(O2)C=CC(=C3)N"

$ MolecularFormula : chr "C14H12N2O" "C18H15N3O2" "C24H34N6O3"

$ MolecularWeight : num 224 305 455 407 290 ...

$ TotalFormalCharge : int 0 0 0 0 0 0 0 0 0 0

$ XLogP : num 3.1 3.2 3 4.5 2.1 2.9 1.8 2.5 0.7 1.7

$ HydrogenBondDonorCount : int 1 0 1 1 2 2 1 1 2 2

$ HydrogenBondAcceptorCount: int 3 4 7 5 4 3 7 8 5 4

$ HeavyAtomCount : int 17 23 33 29 21 26 33 31 20 21

$ TPSA : num 52 57 94.4 68 84.5 70.7 94.4 125 104 69.6

Cheminformaticsin R

178/189

Ready to model

I Having gotten the data.frame of structures we canparse the SMILES into CDK objects and evaluatedescriptors

I However, get.cid (and get.sid) will not work if youtry and retrieve too many at one go

I The current design makes an Entrez URL, which canbecome too large and so is rejected

I Will be switching to the use of PUG

I In the meantime, retreive structure data in chunks

http://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html

Cheminformaticsin R

179/189

Getting CID data in chunks

## will likely fail

structs <- get.cid(assay$PUBCHEM.CID)

## better way, 300 CIDs at a time

library(itertools)

chunks <- as.list(ichunk(assay$PUBCHEM.CID, 300))

structs <- lapply(chunks, get.cid)

structs <- do.call("rbind", structs)

Cheminformaticsin R

180/189

Part V

Extending rcdk

Cheminformaticsin R

181/189

Why extend?

I rcdk doesn’t cover the whole CDK API

I I add functions as and when I need them (or if someoneasks)

I If you want to access CDK classes and methods, you canuse rJava (such as .jnew and .jcall)

I This can get a bit cumbersomeI Use the soure code for inspiration

I For more complex stuff, you can write Java code thatgets installed along with rcdk

I Ideally this would get contributed back and incorporatedinto the rcdk.jar file that is bundled with the rcdkpackage

Cheminformaticsin R

182/189

Structure of the rcdk package

I If you add new functionality, good idea to add unit tests(using RUnit) under rcdk/inst/unitTests/

http://cran.r-project.org/web/packages/RUnit/index.html

Cheminformaticsin R

183/189

Structure of the cdkr project

I Within rcdkjar/, run ant jar to build the jar file andcopy into the rcdk package directory

I Generally, no need to touch rcdklibs/

Cheminformaticsin R

184/189

Writing Java in R

I Don’t like it, but sometimes it’s easier than firing up theJava IDE. Need to be familiar with the CDK API

I Three main functionsI .jnew

I .jcall

I .jcast

Cheminformaticsin R

185/189

Writing Java in R

I First, lets look at a better way to get the atom count ofa molecule

I Right now, we get a molecule and get the list of atomobjects and return the length of the list

I But IAtomContainer has a method getAtomCount()

m <- parse.smiles("CCC")

.jcall(mol, "I", "getAtomCount")

I For static methods, the first cargument can be the fullyqualified class name

I You do need to specify the return type in JNI notation

Cheminformaticsin R

186/189

Writing Java in R

JNI Signature Java Type

I intZ booleanB byteC charLfully-qualified-class; fully-qualified-class[type type[]

I For the L signature, remember the semi-colon at the end

Cheminformaticsin R

187/189

Writing Java in R

I Another example is getting the largest connectedcomponent

I We make use of the ConnectivityChecker class andthe partitionIntoMolecules method

mol <- parse.smiles("CCC(=O)C[O-].[Na+]")

molSet <- .jcall("org.openscience.cdk.graph.ConnectivityChecker",

"Lorg/openscience/cdk/interfaces/IMoleculeSet;",

"partitionIntoMolecules", mol)

ncomp <- .jcall(molSet, "I", "getMoleculeCount")

http://pele.farmbio.uu.se/nightly/cdk-javadoc-1.3.4.git/org/openscience/cdk/graph/ConnectivityChecker.html

Cheminformaticsin R

188/189

Writing Java in Java

I When writing Java in R gets too tedious, switch to Javain Java

I You could write classes to do the work and then packagethem into a jar

I Copy the jar into the inst/cont of the rcdk directory

I Edit .First.lib in R/rcdk.R to add this jar to theclasspath

I Alternatively edit the Java sources for the rcdk packageI Need to access the Github repository

http://github.com/rajarshi/cdkr/

Cheminformaticsin R

189/189

Future developments

I Update rpubchem to use PUG throughout

I Data table and depictions

I Streamline I/O and molecule configuration

I Add more atom and bond level operationsI Convert from jobjRef to S4 objects and vice versa

I Would allow for serialization of CDK data classesI Is it worth the effort?

I Your suggestions

Education

Cheminformatics in R