Upload
rguha
View
8.292
Download
5
Tags:
Embed Size (px)
DESCRIPTION
Slides for a workshop run at the EBI, UK
Citation preview
Cheminformaticsin R
1/189
Cheminformatics in R
or How to Start R and Never Have to Exit
Rajarshi Guha
NIH Chemical Genomics Center
17th May, 2010EBI, Hinxton
Cheminformaticsin R
2/189
Background
Cheminformaticsin R
3/189
Background
I Involved in various aspects ofcheminformatics since 2002
I QSAR modeling, virtual screening,polypharmacology, networks
I Algorithm developementI Cheminformatics software
development
I Core developer for the CDK
Cheminformaticsin R
4/189
Background
I Pretty much live inside of R
I Been using it since 2003, developed a number of Rpackages, mostly public
I Make extensive use of R at NCGC for small molecule &RNAi screening
Cheminformaticsin R
5/189
Acknowledgements
I rcdkI Miguel RojasI Ranke Johannes
I CDKI Egon WillighagenI Christoph SteinbeckI . . .
Cheminformaticsin R
6/189
Resources
I Download binariesI If you’re working on Unix it’s a good idea to compile
from source
I There’s lots of documentation of varying qualityI The R manual is a good document to start withI R Data Import and Export is extremely handy regarding
data loadingI A variety of short introductory tutorials as well as
documents on specific topics in statistical modeling canbe found here
I More general help can be obtained fromI The R-help mailing list. Has a lot of renoowned experts
on the list and if you can learn quite a bit of statistics inaddition to getting R help just be reading the archives ofthe list. Note: they do expect that you have done yourhomework! See the posting guide
I A useful reference for Python/Matlab users learning R
Cheminformaticsin R
7/189
R Books
I A number of books are available ranging fromintroductions to R along with examples up to advancedstatistical modeling using R.
I Data Analysis and Graphics Using R: An Example-basedApproach by John Maindonald, John Braun
I Introductory Statistics with R by Peter Dalgaard
I Using R for Introductory Statistics by John Verzani
I If you end up programming a lot in R, you’ll want tohave Programming with Data: A Guide to the SLanguage by John M. Chambers
Cheminformaticsin R
8/189
IDE’s and editors
I Emacs + ESS - works on Linux, Windows, OS X.Invaluable if you are an Emacs user
I TINN-R is a small and useful editor for WindowsI You can also do R in Eclipse
I StatETI Rsubmit
I A number of different GUI’s are available for, which maybe useful if you’re an infrequent user
I RcommanderI JGRI RattleI PMGI SciViews-RI Red-R
Cheminformaticsin R
9/189
What is R?
I R is an environment for modelingI Contains many prepackaged statistical and mathematical
functionsI No need to implement anything
I R is a matrix programming language that is good forstatistical computing
I Full fledged, interpreted languageI Well integrated with statistical functionalityI More details later
Cheminformaticsin R
10/189
What is R?
I It is possible to use R just for modelingI Avoids programming, preferably use a GUI
I Load data → build model → plot data
I But you can also get much more creativeI Scripts to process multiple data filesI Ensemble modeling using different types of modelsI Implement brand new algorithms
I R is good for prototyping algorithmsI Interpreted, so immediate resultsI Good support for vectorization
I Faster than explicit loopsI Analogous to map in Python and Lisp
I Most times, interpreted R is fine, but you can easilyintegrate C code
Cheminformaticsin R
11/189
What is R?
I R integrates with other languagesI C code can be linked to R and C can also call R functionsI Java code can be called from R and vice versa. See
various packages at rosuda.orgI Python can be used in R and vice versa using Rpy
I R has excellent support for publication quality graphics
I See R Graph Gallery for an idea of the graphingcapabilities
I But graphing in R does have a learning curveI A variety of graphs can be generated
I 2D plots - scatter, bar, pie, box, violin, parallel coordinateI 3D plots - OpenGL support is available
Cheminformaticsin R
12/189
Why Cheminformatics in R?
I In contrast to bioinformatics (cf. Bioconductor), not awhole lot of cheminformatics support for R
I For cheminformatics and chemistry relevant packagesinclude
I rcdk, rpubchem, fingerprintI bio3d, ChemmineR
I A lot of cheminformatics employs various forms ofstatistics and machine learning - R is exactly theenvironment for that
I We just need to add some chemistry capabilities to it
Cheminformaticsin R
13/189
Outline
I An overview of the R language
I An overview of the CDK library
I Exploring the rcdk package
I Exploring the rcdklibs package
I Extending the code
Cheminformaticsin R
14/189
Accessing packages
I From within R (you’ll need 2.11.0 or better)
install.packages(c("rcdk","rpubchem"), dependencies = TRUE)
I Sources for rcdk and rcdklibs can be obtained fromthe Github repository
I This is required if you want to modify Java codeassociated with rcdk
I Installable development packages (OS X, Linux,Windows) available from http://rguha.net/rcdk
I These will make their way to CRAN
Cheminformaticsin R
15/189
Workshop prerequisites
I Everything should be installed
I Running the code below should not give any errors
library(rcdk)
library(rpubchem)
source(helper.R)
I Most of the code snippets in these slides are contained incode.R
I You should have a data/ directory containing variousdatasets used in this workshop
I You should have a exercises/ directory containing thesource code solutions to the exercises
Cheminformaticsin R
16/189
The language
Parallel R
Database access
Part I
Overview of R
Cheminformaticsin R
17/189
The language
Parallel R
Database access
Outline
The language
Parallel R
Database access
Cheminformaticsin R
18/189
The language
Parallel R
Database access
Working in R
I To get help for a function name do: ?functionname
I If you don’t know the name of the function, use:help.search("blah")
I R Site Search and Rseek are very helpful online resources
I To exit from the prompt do: quit()
I Two ways to run R codeI Type or paste it at the promptI Execute code from a source file: source("mycode.R")
I Saving your workI save.image(file="mywork.Rda") saves the whole
workspace to mywork.Rda, which is a binary fileI When you restart R, do: load("mywork.Rda") will restore
your workspaceI You can save individual (or multiple) objects using save
Cheminformaticsin R
19/189
The language
Parallel R
Database access
Basic types in R
Primitive types
I numeric - indicates an integer or floating point number
I character - an alphanumeric symbol (string)
I logical - boolean value. Possible values are TRUE andFALSE
Complex types
I vector - a 1D collection of objects of the same type
I list - a 1D container that can contain arbitrary objects.Each element can have a name associated with it
I matrix - a 2D data structure containing objects of thesame type
I data.frame - a generalization of the matrix type. Eachcolumn can be of a different type
Cheminformaticsin R
20/189
The language
Parallel R
Database access
Primitive types
> x <- 1.2
> x <- 2
> y <- "abcdegf"
> y
[1] "abcdegf"
> z <- c(1,2,3,4,5)
> z
[1] 1 2 3 4 5
> z <- 1:5
> z
[1] 1 2 3 4 5
Cheminformaticsin R
21/189
The language
Parallel R
Database access
Matrices
I Make a matrix from a vector
I The matrix is constructed column-wise
> z <- c(1,2,3,4,5,6,7,8,9,10)
> m <- matrix(z, nrow=2, ncol=5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
Cheminformaticsin R
22/189
The language
Parallel R
Database access
data.frame
I We want to store compound names and number of atomsand whether they are toxic or not
I Need to store: characters, numbers and boolean values
x <- c("aspirin", "potassium cyanide",
"penicillin", "sodium hydroxide")
y <- c(21, 3, 41,3)
z <- c(FALSE, TRUE, FALSE, TRUE)
d <- data.frame(x,y,z)
> d
x y z
1 aspirin 21 FALSE
2 potassium cyanide 3 TRUE
3 penicillin 41 FALSE
4 sodium hydroxide 3 TRUE
Cheminformaticsin R
23/189
The language
Parallel R
Database access
data.frame
I But do the column means? 6 months later, we probablywon’t know (easily)
I So we should add names
> names(d) <- c("Name", "NumAtom", "Toxic")
> d
Name NumAtom Toxic
1 aspirin 21 FALSE
2 potassium cyanide 3 TRUE
3 penicillin 41 FALSE
4 sodium hydroxide 3 TRUE
I When adding column names, don’t use the underscorecharacter
I You can access a column of a data.frame by it’s name:d$Toxic
Cheminformaticsin R
24/189
The language
Parallel R
Database access
Lists
I A 1D collection of arbitrary objects (ArrayList in Java)
I We can access the lists by indexing: mylist[1] ormylist[[1]] for the first element or mylist[c(1,2,3)] forthe first 3 elements
x <- 1.0
y <- "hello there"
s <- c(1,2,3)
mylist <- list(x,y,s)
> mylist
[[1]]
[1] 1
[[2]]
[1] "hello there"
[[3]]
[1] 1 2 3
Cheminformaticsin R
25/189
The language
Parallel R
Database access
Lists
I It’s useful to name individual elements of a list
x <- "oxygen"
z <- 8
m <- 32
mylist <- list(name="oxygen",
atomicNumber=z,
molWeight=m)
> mylist
$name
[1] "oxygen"
$atomicNumber
[1] 8
$molWeight
[1] 32
Cheminformaticsin R
26/189
The language
Parallel R
Database access
Lists
I We can then get the molecular weight by writing
> mylist$molWeight
[1] 32
Cheminformaticsin R
27/189
The language
Parallel R
Database access
Identifying types
I It’s useful initially, to identify the type of the object
I Usually just printing it out explains what it is
I You can be more concise by doing
> x <- 1
> class(x)
[1] "numeric"
> x <- "hello world"
> class(x)
[1] "character"
> x <- matrix(c(1,2,3,4), nrow=2)
> class(x)
[1] "matrix"
Cheminformaticsin R
28/189
The language
Parallel R
Database access
Indexing
I Indexing fundamental to using R and is applicable tovectors, matrices, data.frame’s and lists
I all indices start from 1 (not 0!)
I For a 1D vector, we get the i ’th element by x[i]
I For a 2D matrix of data.frame we get the i , j element byx[i,j]
I But we can also index using vectorsI Called an index vectorI Allows us to select or exclude multiple elements
simultaneously
Cheminformaticsin R
29/189
The language
Parallel R
Database access
Indexing
I Say we have a vector, x and a matrix y
> x
[1] 1 2 3 4 5 6 7 8 9 10
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
I Some ways to get elements from x
I 5th element → x[5]
I The 3rd, 4th and 5th elements → x[ c(3,4,5) ]
I Everything but the 3rd, 4th and 5th elements →x[ -c(3,4,5) ]
Cheminformaticsin R
30/189
The language
Parallel R
Database access
Indexing
I Say we have a matrix y
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
I Some ways to get elements from y
I Element in the first row, second column → y[1,2]
I All elements in the 1st column → y[,1]
I All elements in the 3rd row → y[3,]
I Elements in the first 2 columns → y[ , c(1,2) ]
I Elements not in the first 2 columns → y[ , -c(1,2) ]
Cheminformaticsin R
31/189
The language
Parallel R
Database access
Indexing
I You can also use a boolean vector to index another vector
I In this case, the index vector must of the same length asyour target vector
I If the i ’th element in the index vector is TRUE then thei ’th element of the target is selected
> x <- c(1,2,3,4)
> idx <- c(TRUE, FALSE, FALSE, TRUE)
> x[ idx ]
[1] 1 4
I It’s a pain to have to create a whole index vector by hand
Cheminformaticsin R
32/189
The language
Parallel R
Database access
Indexing
I A more intuitive way to do this, is to make use of R’svector operations
I x == 2 will compare each element of x with the value 2and return TRUE or FALSE
I The result is a boolean vector
> x <- c(1,2,3,4)
> x == 2
[1] FALSE TRUE FALSE FALSE
I So we can select the elements of x that are equal to 2 bydoing
> x <- c(1,2,3,4)
> x[ x == 2 ]
[1] 2
Cheminformaticsin R
33/189
The language
Parallel R
Database access
Indexing
I Or select elements of x that are greater than 2
> x <- c(1,2,3,4)
> x[ x > 2 ]
[1] 3 4
I Or select elements of x that are 2 or 4
> x <- c(1,2,3,4)
> x[ x == 2 | x == 4 ]
[1] 2 4
I The single bar (|) is intentional
I Logical operations using vectors should use |, & ratherthan the usual ||, &&
Cheminformaticsin R
34/189
The language
Parallel R
Database access
List indexing
I With list objects we can index using [ or [[I [ keeps element names, [[ drops names
I Can’t use index vectors or boolean vectors with [[
> y <- list(a=’b’, b=123, x=c(1,2,3))
> y[3]
$x
[1] 1 2 3
> y[[3]]
[1] 1 2 3
> y[[1:2]]
Error in y[[1:2]] : subscript out of bounds
Cheminformaticsin R
35/189
The language
Parallel R
Database access
Subsetting
I Get a subset of the rows of a matrix or data.frame basedon values in a certain column
I We can generate a boolean index vector
I For data.frame’s, we can use a more intuitive approach
d <- data.frame(a=1:10, b=runif(10), c=rnorm(10))
d[d$a <= 6,]
subset(d, a <= 6)
subset(d, a >= 7 & b < 0.4)
Cheminformaticsin R
36/189
The language
Parallel R
Database access
Merging data.frames
I Lets you join data.frames based on a common column
I Pretty much identical to an SQL join (and supportsnatural and outer joins)
I Very handy when merging data from multiple sources
d1 <- data.frame(mol=I(c("mol1", "mol2", "mol3")),
klass=c("active", "active", "inactive"))
d2 <- data.frame(molecule=I(c("mol1", "mol2", "mol3")),
TPSA=c(1.2,3.4,5.6), Kier1=c(5.7, 8.9, 12))
> merge(d1, d2, by.x="mol", by.y="molecule")
mol klass TPSA Kier1
1 mol1 active 1.2 5.7
2 mol2 active 3.4 8.9
3 mol3 inactive 5.6 12.0
Cheminformaticsin R
37/189
The language
Parallel R
Database access
Functions in R
I Functions are much like in any other languageI R employs copy-on-write semantics
I Send in a copy of an input variable if it will be modifiedin the function
I Otherwise send in a reference
I Typing is dynamic
my.add <- function(x,y) {
return(x+y)
}
I You can then call the function as my.add(1,2)
I In general, check for the type of object using class()
I Functions can be anonymous
Cheminformaticsin R
38/189
The language
Parallel R
Database access
Loop constructs
I R does not have C-style looping
for (i = 0; i < 10; i++) {
print(i)
}
I Rather, it works like Python’s for-in idiom
for (i in c(0,2,3,4,5,6,7,8,9)) {
print(i)
}
Cheminformaticsin R
39/189
The language
Parallel R
Database access
Loop constructs
I In general, to loop n times, you create a vector with nelements, say by writing 1 : n
I But like Python loops, you can just loop over elements ofa vector or list and use them
> objects <- list(x=1, y="hello world", z=c(1,2,3))
> for (i in objects) print(i)
[1] 1
[1] "hello world"
[1] 1 2 3
I Good R style discourages explicit loops
Cheminformaticsin R
40/189
The language
Parallel R
Database access
Apply and friends
I Explicit loops (for (...)) are not idiomatic and can beslow
I apply, lapply, sapply resemble functional programmingstyles (but see Reduce, Filter and Map for alternatives)
I Good idea to master these forms
x <- c(1,2,3,4,5)
sapply(x, sqrt)
sapply(x, function(z) z+3)
sapply(x, function(z) rep(z,3), simplify=FALSE)
sapply(x, function(z) rep(z,3), simplify=TRUE)
I Handy for looping over vector elements
Cheminformaticsin R
41/189
The language
Parallel R
Database access
Looping over lists
I For list objects, use lapply
I The return value is a list - if it can be converted to asimple vector use unlist to do so
x <- list(x=1, b=2, c=3)
lapply(x, sqrt)
unlist(lapply(x, sqrt))
lapply(x, function(z) {
return(z^2)
})
I The function argument of lapply should expect an objectas the first argument
I Columns of a data.frame are lists, so lapply can beused to loop over columns
Cheminformaticsin R
42/189
The language
Parallel R
Database access
Looping over multiple objects
I What if you want to loop over two (or more) lists (orvectors) simulataneously
I Like Pythons zip would allow
I Use mapply and provide a function that takes (at least) asmany arguments as input vectors/lists
x <- 1:3
y <- list(a="Hello", b=12.34, c=c(1,2,3))
mapply(function(a,b) {
print(a)
print(b)
print("---")
}, x, y)
Cheminformaticsin R
43/189
The language
Parallel R
Database access
Looping over matrices
I Use apply to operate on entire rows or columns of amatrix or data.frame
I Be careful when working on rows of a data.frame -behavior can be unexpected if all columns are not of thesame type
m <- matrix(1:12, nrow=4)
apply(m, 1, sum) ## sum the rows
apply(m, 2, sum) ## sum the columns
I The function argument of apply should expect a vector asthe first argument
Cheminformaticsin R
44/189
The language
Parallel R
Database access
Looping caveats
I The main thing to remember when using apply, lapply
etc is nature of inputs to the user supplied function
I apply will send a vector (corresponding to a row orcolumn)
I lapply and sapply will send a single element of thelist/vector
I mapply will send a single element of each of the inputlists/vectors
I The function argument of these methods can takeanonymous functions or pre-defined functions
Cheminformaticsin R
45/189
The language
Parallel R
Database access
Loading data
I Writing vectors and matrices by hand is good for 15minutes!
I Easier to read data from files
I R supports a variety of text and binary formatsI Certain packages can be loaded to handle special formats
I Read from SPSS, Stata, SASI Read microarray data, GIS data, chemical structure data
I Excel files can also be read (easiest on Windows)
I Data can be accessed from SQL databases (Postgres,MySQL, Oracle)
Cheminformaticsin R
46/189
The language
Parallel R
Database access
Reading CSV data
I Consider the extract of a CSV file below
ID,Class,Desc1,Desc2,Desc3,Desc4,Desc5
w150,nontoxic,0.37,0.44,0.86,0.44,0.77
c49,toxic,0.37,0.58,0.05,0.85,0.57
s97,toxic,0.66,0.63,0.93,0.69,0.08
I Key features includeI A header line, naming the columnsI An ID column and a Class column, both characters
I Reading this data file is as simple as
> dat <- read.csv("data1.csv", header=TRUE)
I If you print dat it will scroll down your screen
Cheminformaticsin R
47/189
The language
Parallel R
Database access
Reading CSV data
I To get a quick summary of the data.frame use the str()
function
> str(dat)
’data.frame’: 100 obs. of 7 variables:
$ ID : Factor w/ 100 levels "a115","a180",..: 83 7 66 85 64 82 33 10 92 95 ...
$ Class: Factor w/ 2 levels "nontoxic","toxic": 1 2 2 1 2 1 1 1 1 2 ...
$ Desc1: num 0.37 0.37 0.66 0.18 0.88 0.37 0.72 0.72 0.2 0.68 ...
$ Desc2: num 0.44 0.58 0.63 0.09 0.1 0.14 0.03 0.15 0.65 0.66 ...
$ Desc3: num 0.86 0.05 0.93 0.57 0.85 0.67 0.99 0.68 0.77 0.78 ...
$ Desc4: num 0.44 0.85 0.69 0.85 0.58 0.17 0.95 0.19 0.69 0.61 ...
$ Desc5: num 0.77 0.57 0.08 0.17 0.74 0.56 0.22 0.19 0.91 0.13 ...
Cheminformaticsin R
48/189
The language
Parallel R
Database access
What is a factor?
I The factor type is used to represent categorical variablesI Toxic, non-toxicI Active, inactiveI Yes, no, maybe
I They correspond to enum’s in C/Java
I A factor variable looks like an ordinary vector ofcharacters
I Internally they are represented by integersI A factor variable has a property called the levels
I Indicates what values are valid for the factor
> levels(dat$Class)
[1] "nontoxic" "toxic"
Cheminformaticsin R
49/189
The language
Parallel R
Database access
What is a factor?
I If you try to assign an invalid value to a factor you get awarning
> dat$Class[1] <- "Yes"
Warning message:
In ‘[<-.factor‘(‘*tmp*‘, 1, value = "Yes") :
invalid factor level, NAs generated
> dat$Class[1:10]
[1] <NA> toxic toxic nontoxic toxic nontoxic nontoxic nontoxic
[9] nontoxic toxic
Levels: nontoxic toxic
Cheminformaticsin R
50/189
The language
Parallel R
Database access
Operating on groups
I Factors are very useful when you’d like to operate onsubsets of the data, as defined by factor levels
I Assay data might label molecules as active, inactive,inconclusive
I by is similar to lapply but operates on data.frame’s in afactor-wise way
I Takes a function argument which will receive a subset ofthe input data.frame, corresponding to a single level ofthe factor
Cheminformaticsin R
51/189
The language
Parallel R
Database access
assay <- read.csv("data/aid2170.csv", header=TRUE)
str(assay)
levels(assay$PUBCHEM.ACTIVITY.OUTCOME)
## how many actives and inactives are there?
by(assay, assay$PUBCHEM.ACTIVITY.OUTCOME, function(x) {
nrow(x)
})
## summary stats of the inhibiion values, by group
by(assay, assay$PUBCHEM.ACTIVITY.OUTCOME, function(x) {
summary(x$Inhibition)
})
## or distributions of the inhibition values, by group
par(mfrow=c(1,2))
by(assay, assay$PUBCHEM.ACTIVITY.OUTCOME, function(x) {
hist(x$Inhibition)
})
Cheminformaticsin R
52/189
The language
Parallel R
Database access
Outline
The language
Parallel R
Database access
Cheminformaticsin R
53/189
The language
Parallel R
Database access
Parallel R
I R itself is not multi-threadedI Well suited for embarassingly parallel problems
I Even then, a number of “large data” problems are nottractable
I Recent developments on integrating R and Hadoopaddress this
I See the RHIPE package
I We’ll look at snow which allows distribution ofprocessing on the same machine (multiple CPU’s) ormultiple machines
I But see snowfall for a nice set of wrappers around snow
I Also see multicore for a package that focuses onparallel processing on multicore CPU’s
Cheminformaticsin R
54/189
The language
Parallel R
Database access
Trivial parallelization with snow
I snow lets you use multiple machines or multiple cores ona single machine
I Well suited for data-parallel problems
I If using multiple machines, data will be sent across thenetwork, so large chunks of data can slow things down
I General strategy isI Initialize clusterI Send variables required for calculation to the nodesI Call a parallel function
Cheminformaticsin R
55/189
The language
Parallel R
Database access
Use case - exhaustive feature selection
I Given a set of N descriptors, find a n-descriptor subsetthat leads to a good predictive model
I Will obviously explode - 50C5 = 2, 118, 760 descriptorcombinations
I Usually we would employ a genetic algorithm or simulatedannealing or even some form of stepwise selection
I For pedagogical purposesI how can we do this exhaustively?I how can we speed it up?
Cheminformaticsin R
56/189
The language
Parallel R
Database access
The data and a model
I Lets consider some melting point data
I Precalculated and reduced descriptor set
I 184 molecules in training set, 34 descriptors
I Goal - find a good 4 descriptor OLS model
0Bergstrom, C.A.S. et al, J. Chem. Inf. Model., 2003, 43, 1177–1185
Cheminformaticsin R
57/189
The language
Parallel R
Database access
A serial solution
I Evaluate all 4-member combinations of the numbers{1, 2, . . . , 34}
I These are the indices of the descriptor columns
I Loop over the rows, use the values as column indices ofthe descriptor matrix, build a model
## First some data prep
library(gtools)
load("data/bergstrom/bergstrom.Rda")
depv <- train.desc[,1]
desc <- train.desc[,-1]
idx <- 1:nrow(desc)
combos <- combinations(ncol(desc), 4)
Cheminformaticsin R
58/189
The language
Parallel R
Database access
A serial solution
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x))
system.time(
rmse.serial <- apply(combos, 1, function(idx, mydata, y) {
dat <- data.frame(y=y, x=mydata[,idx])
m <- lm(y~., dat)
return(rmse(m$fitted,y))
}, mydata=desc, y=depv)
)
I Takes about 200 seconds
Cheminformaticsin R
59/189
The language
Parallel R
Database access
The parallel solution
I Code is pretty much identical
I First need to initialize the “cluster” - in this case we’llrun multiple processes on the same machine
library(snow)
clus <- makeCluster(rep("localhost",2), type="SOCK")
system.time(
rmse.par <- parApply(clus, combos, 1, function(idx, mydata, y) {
dat <- data.frame(y=y, x=mydata[,idx])
m <- lm(y~., dat)
return(rmse(m$fitted,y))
}, mydata=desc, y=depv)
)
I Takes about 120 seconds, but easily scales up withmultiple cores
Cheminformaticsin R
60/189
The language
Parallel R
Database access
Outline
The language
Parallel R
Database access
Cheminformaticsin R
61/189
The language
Parallel R
Database access
R and databases
I Bindings to a variety of databases are availableI Mainly RDBMS’s but some NoSQL databases are being
interfaced
I The R DBI spec lets you write code that is portable overdatabases
I Note that loading multiple database packages can lead toproblems
I This can happen even when you don’t explicitly load adatabase package
I Some Bioconductor packages load the RSQLite packageas a dependency, which can interfere with, say, ROracle
Cheminformaticsin R
62/189
The language
Parallel R
Database access
A quick look at RSQLite
I Not suitable for very large data or multiple connections
I Definitely look at sqldf, that makes it very easy to useSQL on R data.frames
I Very brief overview of what we can do with RSQLite forlocal storage
I Should transer easily to other databases
Cheminformaticsin R
63/189
The language
Parallel R
Database access
Basic usage - connecting
I We’ll use a database I prepared and placed atdata/db/assay.db
I Lets first get some information about the tables and fielddefinitions
library(RSQLite)
## specific to RSQLite
sqlite <- dbDriver("SQLite")
con <- dbConnect(sqlite, dbname = "data/db/assay.db")
> dbListTables(con)
[1] "assay"
> dbListFields(con, "assay")
[1] "row_names" "aid" "PUBCHEM_SID" "PUBCHEM_CID"
"PUBCHEM_ACTIVITY_OUTCOME" "PUBCHEM_ACTIVITY_SCORE"
"PUBCHEM_ASSAYDATA_COMMENT"
Cheminformaticsin R
64/189
The language
Parallel R
Database access
Basic usage - querying
I The simplest way to get data is to slurp in the wholetable as a data.frame
I Not a good idea if the table is very large
assay <- dbReadTable(con, "assay")
> str(assay)
’data.frame’: 188264 obs. of 6 variables:
$ aid : num 396 396 396 429 429 429 429 429 429 429 ...
$ PUBCHEM_SID : int 10318951 10318962 10319004 842121 842122 842123 842124 842125 842126 842127 ...
$ PUBCHEM_CID : int 6419748 5280343 969516 6603008 6602571 6602616 644371 6603132 2850911 6603374 ...
$ PUBCHEM_ACTIVITY_OUTCOME : chr "active" "active" "active" "inactive" ...
$ PUBCHEM_ACTIVITY_SCORE : int 100 100 100 0 1 1 1 -3 0 0 ...
$ PUBCHEM_ASSAYDATA_COMMENT: int 20061024 20060421 20061024 20061127 20061127 20061127 20061127 20061127 20060622 20061127 ...
Cheminformaticsin R
65/189
The language
Parallel R
Database access
Basic usage - querying
I Better to get subsets of the table via SQL queries
I Two ways to do it - get back all rows from the query orretrieve in chunks
I The latter is better when you might get a lot of rows
## query and get results in one go
dbGetQuery(con, "select distinct aid from assay")
## query and then get results in chunks
res <- dbSendQuery(con, "select * from assay where aid = 429")
fetch(res, n=10) ## get first 10 results
fetch(res, n=10) ## get next 10
fetch(res, n=-1) ## get remaining results
## no more rows to fetch
fetch(res, n=-1)
dbClearResult(res) ## a vital step!
Cheminformaticsin R
66/189
The language
Parallel R
Database access
Saving data in a database
I Can directly write a data.frame to a database file
I RSQLite will automatically generate a CREATE TABLEstatement
library(rpubchem)
## get some assay data from PubChem
assay <- sapply(c(396, 429, 438, 445), function(x) {
data.frame(aid=x, get.assay(x))[,1:6]
}, simplify=FALSE)
## join all assays into a single data.frame
assay <- do.call("rbind", assay)
## Make a new table
sqlite <- dbDriver("SQLite")
con <- dbConnect(sqlite, dbname = "assay.db")
dbWriteTable(con, "assay", assay)
dbDisconnect(con)
Cheminformaticsin R
67/189
The language
Parallel R
Database access
So why use a database?
I Dont have to load bulk CSV or .Rda files each time westart work
I Can index data in RDBMS’s so queries can be very fast
I Good way to exchange data between applications (asopposed to .Rda files which are only useful between Rusers)
I All the code above, except for the dbConnect statementwill be identical for PostgreSQL, MySQL etc.
Cheminformaticsin R
68/189
Introduction
I/O
Fingerprints
SubstructuresPart II
The Chemistry Development Kit
Cheminformaticsin R
69/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
Cheminformaticsin R
70/189
Introduction
I/O
Fingerprints
Substructures
Resources
I The CDK wiki
I The nightly build page
I Code snippets
I cdk-user mailing list
I Keyword list and feature listI If you want to develop with the CDK, use the
comprehensive jar file that contains all dependenciesI from the nightly build page for the cutting edgeI or from Sourceforge for the last stable release
Cheminformaticsin R
71/189
Introduction
I/O
Fingerprints
Substructures
Some CDK issues
I Learning the CDK is non-trivial
I No real tutorial but see here for a start
I Best place is to use the keyword list to find the relevantclass and then look at the unit tests for a class
Cheminformaticsin R
72/189
Introduction
I/O
Fingerprints
Substructures
What does the CDK provide?
I Fundamental chemical objectsI atomsI bondsI molecules
I More complex objects are also availableI SequencesI ReactionsI Collections of molecules
I Input/Output for a wide variety of molecular file formats
I Fingerprints and fragment generation
I Rigid alignments, pharmacophore searching
I Substructure searching, SMARTS support
I Molecular descriptors
Cheminformaticsin R
73/189
Introduction
I/O
Fingerprints
Substructures
Some design principles
I Abstraction is the key principleI Applies to chemical objects such as atoms, bondsI Also applied to input/output
I Rather than directly creating an atom object we use afactory method
IAtom atom = DefaultChemObjectBuilder.getInstance().
newAtom();
and not
IAtom atom = new Atom();
Cheminformaticsin R
74/189
Introduction
I/O
Fingerprints
Substructures
Some design principles
I By using DefaultChemObjectBuilder we don’t worryhow an atom or bond is created
I By using this type of abstraction we can load any fileformat without having to specify it
I Another aspect is the use of interfaces rather thanconcrete types
I A molecule is a collection of atoms and bondsI A graph view is not imposed directlyI Molecular features such as aromaticity are properties of
the atoms and bonds
Cheminformaticsin R
75/189
Introduction
I/O
Fingerprints
Substructures
Molecules
I A molecule is fundamentally represented as anIAtomContainer object
I For many purposes this is fineI In some cases specialization is required so we have
I CrystalI RingI Fragment
I Some methods require a specific subclass
I Many methods simply require an IAtomContainer, so youcan use any of the above subclasses
Cheminformaticsin R
76/189
Introduction
I/O
Fingerprints
Substructures
Molecules
I Similar procedure to getting an atom or bond object
IAtomContainer container = DefaultChemObjectBuilder.
getInstance().
newAtomContainer();
I Then we populate it
container.addAtom( atom1 );
container.addAtom( atom2 );
container.addBond( atom1, atom2, 1.5);
Cheminformaticsin R
77/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
Cheminformaticsin R
78/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I SMILES are a common format
I SMILES parsing, you will have to handle theInvalidSmilesException
SmilesParser sp = new SmilesParser(
DefaultChemObjectBuilder.
getInstance());
IMolecule molecule = sp.parseSmiles("CCCC(CC)CC=CC=CC");
I Given a molecule object, we can create a SMILES
SmilesGenerator sg = new SmilesGenerator();
String smiles = sg.createSMILES(molecule);
System.out.println("SMILES = "+smiles);
Cheminformaticsin R
79/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I Reading files from disk is a common task
I A little more complex due to abstraction but is quitegeneral
File file = new File("mols.sdf");
FileReader fileReader = new FileReader(file);
MDLV2000Reader reader = new MDLV2000Reader(fileReader);
IChemFile chemFile = (IChemFile) reader.
read(new ChemFile());
List containers = ChemFileManipulator.
getAllAtomContainers(chemFile);
Cheminformaticsin R
80/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I This is nice since it allows you to get all the molecules inthe file at one go
I Not a good idea for very large SD files
I In such cases, use IteratingMDLReader
I Allows you to iterate over arbitrarily large SD files
I There is also an IteratingSMILESReader for big SMILESfiles
Cheminformaticsin R
81/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I Writing files is also supported for a wide variety offormats
I For example, to write to SD format
FileWriter w1 = new FileWriter(new File("molecule.sdf"));
try {
MDLWriter mw = new MDLWriter(w1);
mw.write(molecule);
mw.close();
} catch (Exception e) {
System.out.println(e.toString());
}
Cheminformaticsin R
82/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I SD tags are a common way to associate propertyinformation with a molecular structure
I See PubChem SD files to get an idea of what we can putin them
CDK 1/28/07,15:24
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <myProperty>
1.2
> <anotherProperty>
Hello world
Cheminformaticsin R
83/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I When a molecule is read in from an SD file we can getthe value of a given tag by doing
Object value = molecule.getProperty("myProperty");
I When writing a molecule we can supply a set of tags in aHashMap
HashMap tags = new HashMap();
tags.put("myProperty", new Double(1.2));
Cheminformaticsin R
84/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I To ensure tags are written to disk we would do
FileWriter w1 = new FileWriter(new File("molecule.sdf"));
try {
MDLWriter mw = new MDLWriter(w1);
mw.setSdFields( tags );
mw.write(molecule);
mw.close();
} catch (Exception e) {
System.out.println(e.toString());
}
Cheminformaticsin R
85/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I A molecule might have several properties associated withit.
I We can avoid creating an extra HashMap, when writingthem all to disk
MDLWriter mw = new MDLWriter(w1);
mw.setSdFields( molecule.getProperties() );
mw.write(molecule);
mw.close();
Cheminformaticsin R
86/189
Introduction
I/O
Fingerprints
Substructures
Generating canonical SMILES
I The SmilesGenerator class will create canonical SMILES
String smiles = "C(C)(C)CC=CC(C(CC(C))CC)CC";
SmilesParser sp = new SmilesParser();
IMolecule mol = sp.parseSmiles(smiles);
SmilesGenerator sg = new SmilesGenerator();
String canSmi = sg.createSMILES(mol);
Cheminformaticsin R
87/189
Introduction
I/O
Fingerprints
Substructures
Fingerprints
1 0 1 1 0 0 0 1 0
I Used in many areas - database screening, diversityanalysis, modeling
Cheminformaticsin R
88/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
Cheminformaticsin R
89/189
Introduction
I/O
Fingerprints
Substructures
Fingerprints
I The simplest fingerprint looks for specific substructuresand sets a specific bit
I MACCSFingerprinterI SubstructureFingerprinter
I Hashed fingerprints don’t maintain a correspondencebetween bit position and substructural feature
Cheminformaticsin R
90/189
Introduction
I/O
Fingerprints
Substructures
Getting fingerprints
I The CDK can generate binary fingerprints
I This gives a default fingerprint of 1024 bits - but you canspecify smaller or larger fingerprints
I The CDK also provides variants of the default fingerprintI MACCSFingerprintI ExtendedFingerPrinter - includes bits related to ringsI GraphOnlyFingerprinter - simplified version of
fingerprinter that ignores bond orderI SubstructureFingerprinter - generates fingerprints based
on a specified set of substructures
IFingerprinter fprinter = new Fingerprinter()
BitSet fingerprint = fprinter.getFingerprint(molecule);
Cheminformaticsin R
91/189
Introduction
I/O
Fingerprints
Substructures
Evaluating similarity
I Given two molecules we are interested in determininghow similar they are
I A common approach is to evaluate their fingerprints
I Calculate a similarity metric (e.g., Tanimoto coefficient)
BitSet fp1 = Fingerprinter.getFingerprint(molecule1);
BitSet fp2 = Fingerprinter.getFingerprint(molecule2);
float tc = Tanimoto.calculate(fp1, fp2);
Cheminformaticsin R
92/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
Cheminformaticsin R
93/189
Introduction
I/O
Fingerprints
Substructures
Substructure searching
I Identify the presence/absence of a substructure in amolecule
IAtomContainer mol = ...;
SMARTSQueryTool sqt = new SMARTSQueryTool("C");
sqt.setSmarts("C(=O)C");
boolean matched = sqt.matches(mol);
I In many cases we’re interested in all the possible matches
List<List<Integer>> maps;
maps = sqt.getUniqueMatchingAtoms();
I maps.size() gives you the number of matches
I Each element of maps is a sequence of atom ID’s
Cheminformaticsin R
94/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I We’ll look at constructing a pharmacophore query byhand
I Generally, you’d read it from a file
I Given a query we can search for that in a target molecule
I The target molecule should have 3D coordinates
I Generally you’ll examine multiple conformers for a givenmolecule
Cheminformaticsin R
95/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Construct pharmacophore groups
PharmacophoreQueryAtom o =
new PharmacophoreQueryAtom("D", "[!H0;#7,#8,#9]");
PharmacophoreQueryAtom n1 =
new PharmacophoreQueryAtom("A", "c1ccccc1");
PharmacophoreQueryAtom n2 =
new PharmacophoreQueryAtom("A", "c1ccccc1");
Cheminformaticsin R
96/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Construct pharmacophore constraints
PharmacophoreQueryBond b1 =
new PharmacophoreQueryBond(o, n1, 4.0, 4.5);
PharmacophoreQueryBond b2 =
new PharmacophoreQueryBond(o, n2, 4.0, 5.0);
PharmacophoreQueryBond b3 =
new PharmacophoreQueryBond(n1, n2, 5.4, 5.8);
Cheminformaticsin R
97/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Given pharmacophore groups and constraints, weconstruct a pharmacophore query
I Design to have the same semantics as a normal molecule
QueryAtomContainer query =
new QueryAtomContainer();
query.addAtom(o);
query.addAtom(n1);
query.addAtom(n2);
query.addBond(b1);
query.addBond(b2);
query.addBond(b3);
Cheminformaticsin R
98/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Given the query we can now perform a pharmacophoresearch
IAtomContainer aMolecule;
// load in the molecule
PharmacophoreMatcher matcher =
new PharmacophoreMatcher(query);
boolean status = matcher.matches(aMolecule);
if (status) {
// get the matching pharmacophore groups
List<List<PharmacophoreAtom>> pmatches =
matcher.getUniqueMatchingPharmacophoreAtoms();
}
Cheminformaticsin R
99/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Part III
Cheminformatics & R - rcdk
Cheminformaticsin R
100/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Using the CDK in R
I Based on the rJava package
I Two R packages to install (not counting thedependencies)
I Provides access to a variety of CDK classes and methods
I Idiomatic R
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
Cheminformaticsin R
101/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Cheminformaticsin R
102/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading in data
I The CDK supports a variety of file formats
I rcdk loads all recognized formats, automatically
I Data can be local or remote
mols <- load.molecules( c("data/io/set1.sdf",
"data/io/set2.smi",
"http://rguha.net/rcdk/remote.sdf"))
I Gives you a list of Java references representingIAtomContainer objects
I Can’t do much with these objects, except via rcdkfunctions
Cheminformaticsin R
103/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading structures from text files
I SMILES strings can be contained in arbitrary text files(such as CSV)
I If you read such a file with read.table or read.csv, makesure to set comment.char=""
I Also a good idea to specify the colClasses argument orelse as.is=TRUE- otherwise SMILES will be read in asfactors
Cheminformaticsin R
104/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading SMILES strings
I Also useful to load molecules directly from a SMILESstring
I parse.smiles parses a single molecule
mol <- parse.smiles("c1ccccc1C(=O)N")
## loop over multiple SMILES
smiles <- c("CCC", "c1ccccc1",
"C(C)(C=O)C(CCNC)C1CC1C(=O)")
mols <- sapply(smiles, parse.smiles)
I But the second call is inefficient, especially for largeSMILES collections
Cheminformaticsin R
105/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Efficient SMILES parsing
I By default parse.smiles instantiates a SMILES parserobject (SmilesParser)
I So when looping over a vector of SMILES strings we’recreating a parser object each time
I In such a case, specify a parser object explicitly - 2Xspeedup
smilesParser <- get.smiles.parser()
## now specify this parser
smiles <- c("CCC","c1ccccc1","C(C)(C=O)C(CCNC)C1CC1C(=O)")
mols <- sapply(smiles, parse.smiles, parser = smilesParser)
Cheminformaticsin R
106/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Efficient SMILES parsing
> parser <- get.smiles.parser()
> smiles <- rep(c("CCC",
"c1ccccc1",
"C(C)(C=O)C(CCNC)C1CC1C(=O)"),
500)
> system.time(junk <- sapply(smiles, parse.smiles))
user system elapsed
4.088 0.106 3.946
> system.time(junk <- sapply(smiles, parse.smiles, parser=parser))
user system elapsed
1.792 0.032 1.705
Cheminformaticsin R
107/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
You get what is provided . . . and no more
I CDK philosophy is not to do extra work
I For file reading this implies that it only parses what isspecified in the file and will not perform operations thatare not explicitly indicated
I As a result some features (isotopic masses) of moleculesand atoms are not automatically configured
I load.molecule will do these extra steps
I parse.smiles will not
Cheminformaticsin R
108/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Configuring molecules
## no explicit configuration needed
mols <- load.molecules("data/io/bp.smi")
get.exact.mass(mols[[1]])
## explicit configuration required
mol <- parse.smiles("c1ccccc1")
get.exact.mass(mol)
I The last call will give a NullPointerException sincethe isotopic masses have not been configured
Cheminformaticsin R
109/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Configuring molecules
I When parsing SMILES, do the configuration by hand
smi <- "OC(=O)C1=C(C=CC=C1)OC(=O)C"
mol <- parse.smiles(smi)
## do the configuration
do.typing(mol) ## not required in recent versions
do.aromaticity(mol) ## not required in recent versions
do.isotopes(mol)
## now, this works
get.exact.mass(mol)
Cheminformaticsin R
110/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Hydrogens - implicit & explicit
I SMILES indicates that when hydrogens are not specified,they are to be considered implicit
I But MDL MOL files have no such specification, so if noH’s specified, the molecule will not have explicit orimplicit hydrogens
mol <- load.molecules("data/noh.mol")[[1]]
unlist(lapply(get.atoms(mol), get.symbol))
I In such cases, convert.implicit.to.explicit willadd implicit H’s (to satisfy valencies) and then convertthem to explicit
convert.implicit.to.explicit(mol)
unlist(lapply(get.atoms(mol), get.symbol))
Cheminformaticsin R
111/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
A note on function calls
I R uses copy-on-write semantics for function callsI If a function modifies its input argument, R will send a
copy of the input variable to the functionI Otherwise send a referenceI “Safety” of call-by-value, efficiency of call-by-reference
I In contrast, convert.implicit.to.explicit modifiesthe input argument
I No return value
I In general rJava allows you to work with Java objectsusing call-by-reference semantics
Cheminformaticsin R
112/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Writing molecules
I Currently only SDF is supported as an output file format
I By default a multi-molecule SDF will be written
I Properties are not written out as SD tags by default
smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")
mols <- sapply(smis, parse.smiles)
## all molecules in a single file
write.molecules(mols, filename="mols.sdf")
## ensure molecule data is written out
write.molecules(mols, filename="mols.sdf", write.props=TRUE)
## molecules in individual files
write.molecules(mols, filename="mols.sdf", together=FALSE)
Cheminformaticsin R
113/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Generating SMILES
I Finally, we can also create SMILES strings frommolecules via get.smiles
I Works on a single molecule at a time
I Canonical SMILES are generated by default
smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")
mols <- sapply(smis, parse.smiles)
lapply(mols, get.smiles)
Cheminformaticsin R
114/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Cheminformaticsin R
115/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing attached data
I Some file formats allow you to attach data to a structure
I SDF is a good example, using named tags and associatedvalues
I For example, a DrugBank SDF has fields such as
> <DRUGBANK_ID>
DB00135
> <DRUGBANK_GENERIC_NAME>
L-Tyrosine
> <DRUGBANK_MOLECULAR_FORMULA>
C9H11NO3
> <DRUGBANK_MOLECULAR_WEIGHT>
181.1885
Cheminformaticsin R
116/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing data fields
I The CDK reads SDF tags and their values into theIAtomContainer object
I We can access them via the get.property andget.properties
> mols <- load.molecules("data/io/set1.sdf")
> get.properties(mols[[1]])
$DRUGBANK_ID
[1] "DB00135"
$DRUGBANK_IUPAC_NAME
[1] "(2S)-2-amino-3-(4-hydroxyphenyl)propanoic acid"
$‘cdk:Title‘
[1] "135"
...
Cheminformaticsin R
117/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing data fields
I If you know the name of the field, just pull the valuedirectly
get.property(mols[[1]], "DRUGBANK_GENERIC_NAME")
I Note that values for all tags are character, so you needto cast them to the appropriate types manually
Cheminformaticsin R
118/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Extra tags
I In some cases, you’ll see tag names that are not in theSDF file
I The CDK uses molecule property fields to attacharbitrary data.
I For example, the title field from a SDF entry is storedusing the cdk:Title name
I In general, all CDK-derived properties have namesprefixed by “cdk:”
## identify CDK-derived properties
props <- get.properties(mols[[1]])
props[ grep("^cdk:", names(props)) ]
Cheminformaticsin R
119/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Setting property values
I You can add your own property values
I The type of the value is preserved, as long as you arewithin R
I The type of the name (or key) must be character
mol <- parse.smiles("c1ccccc1Cc2ccccc2")
set.property(mol, "MyProperty", 23.14)
set.property(mol, "toxic", "yes")
get.properties(mol)
Cheminformaticsin R
120/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Persisting molecular data
I You could extract the properties and write them to a textfile
mols <- load.molecules("data/io/set1.sdf")
props <- lapply(mols, get.properties)
## convert list to table and write it out
props <- do.call("rbind", props)
write.table(props, "drugdata.txt", row.names=FALSE)
Cheminformaticsin R
121/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Persisting molecular data
I Molecular data can also be persisted by writing to SDFformat
mol <- parse.smiles("c1ccccc1")
set.property(mol, "foo", 1)
set.property(mol, "bar", "baz")
write.molecules(mol, "mol.sdf", write.props=TRUE)
Cheminformaticsin R
122/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Working with molecules
I Currently you can access atoms, bonds, get certain atomproperties, 2D/3D coordinates
I Since rcdk doesn’t cover the entire CDK API, you mightneed to drop down to the rJava level and make calls tothe Java code by hand
I I’m happy to code specific tasks in idiomatic R
Cheminformaticsin R
123/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Working with molecules
I A useful task is to wash or standardize molecules
I CDK doesn’t provide a method to do this directly
I But a common operation is to check for disconnectedmolecules and keep the largest fragment
> m <- parse.smiles("c1ccccc1")
> is.connected(m)
[1] TRUE
>
> m <- parse.smiles("C1CC(N=C1)C(=O)[O-].[Na+]")
> is.connected(m)
[1] FALSE
>
> largest <- get.largest.component(m)
> length(get.atoms(largest))
[1] 8
Cheminformaticsin R
124/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Getting atoms & bonds
I Simple to get atoms and bonds
mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")
atoms <- get.atoms(mol)
bonds <- get.bonds(mol)
I These are lists of jobjRef objects
I rcdk currently supports getting the symbol, coordinates,atomic number, implicit H-count
I Can also check whether an atom is aromatic etc.
I ?get.symbol
Cheminformaticsin R
125/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Working with atoms
I Simple elemental analysis
I Identifying flat molecules
mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")
atoms <- get.atoms(mol)
## elemental analysis
syms <- unlist(lapply(atoms, get.symbol))
round( table(syms)/sum(table(syms)) * 100, 2)
## is the molecule flat?
coords <- do.call("rbind", lapply(atoms, get.point3d))
any(apply(coords, 2, function(x) length(unique(x)) == 1))
Cheminformaticsin R
126/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
SMARTS matching
I rcdk supports substructure searches with SMARTS orSMILES
I May not be practical for large collections of moleculesdue to memory
mols <- sapply(c("CC(C)(C)C",
"c1ccc(Cl)cc1C(=O)O",
"CCC(N)(N)CC"), parse.smiles)
query <- "[#6D2]"
hits <- matches(query, mols)
> print(hits)
CC(C)(C)C c1ccc(Cl)cc1C(=O)O CCC(N)(N)CC
FALSE TRUE TRUE
Cheminformaticsin R
127/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Persisting objects
I We already saw how we can persist molecule properties
I It’d be useful to also persist CDK objects (i.e.,jobjRef’s)
m <- parse.smiles("c1ccccc1")
obj <- .jserialize(m)
newObj <- .junserialize(obj)
I Serialized objects are just raw vectors, so can be savedvia save
I But also take a look at .jcache
Cheminformaticsin R
128/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Cheminformaticsin R
129/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
I rcdk supports visualization of 2D structure images intwo ways
I First, you can bring up a Swing window
I Second, you can obtain the depiction as a raster image
I Doesn’t work on OS X
mols <- load.molecules("data/dhfr_3d.sd")
## view a single molecule in a Swing window
view.molecule.2d(mols[[1]])
## view a table of molecules
view.molecule.2d(mols[1:10])
Cheminformaticsin R
130/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
Cheminformaticsin R
131/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
I The Swing window is a little heavy weight
I It’d be handy to be able to annotate plots with structures
I Or even just make a panel of images that could be savedto a PNG file
I We can make use of rasterImage and rcdk
I As with the Swing window, this won’t work on OS X
Cheminformaticsin R
132/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
m <- parse.smiles("c1ccccc1C(=O)NC")
img <- view.image.2d(m, 200,200)
## start a plot
plot(1:10, 1:10, pch=19)
## overlay the structure
rasterImage(img, 1,8, 3,10)
Cheminformaticsin R
133/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
I By playing around with plotting parameters we can fill upthe plot region with an image
I The values being plotted define the coordinates used tooverlay the raster image
mols <- load.molecules("data/dhfr_3d.sd")
## A table of 16 structures
par(mfrow=c(4,4), mar=c(0,0,0,0))
for (i in 1:16) {
img <- view.image.2d(mols[[i]], 200, 200)
plot(1:10, xaxt="n", yaxt="n", xaxs="i", yaxs="i", col="white")
rasterImage(img, 1,1, 10,10)
}
Cheminformaticsin R
134/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Cheminformaticsin R
135/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Molecular descriptors
I Numerical representations of chemical structure featuresI Can be based on
I connectivityI 3D coordnatesI electronic propertiesI combination of the above
I Many descriptors are described and implemented invarious forms
I The CDK implements 45 descriptor classes, resulting in≈ 300 individual descriptor values for a given molecule
Cheminformaticsin R
136/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Descriptor caveats
I Not all descriptors are optimized for speed
I Some of the topological descriptors employ graphisomorphism which makes them slow on large molecules
I In general, to ensure that we end up with a rectangulardescriptor matrix we do not catch exceptions
I Instead descriptor calculation failures return NA
Cheminformaticsin R
137/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
CDK Descriptor Classes
I The CDK provides 3 packages for descriptor calculationsI
org.openscience.cdk.qsar.descriptors.molecularI org.openscience.cdk.qsar.descriptors.atomicI org.openscience.cdk.qsar.descriptors.bond
I rcdk only supports molecular descriptorsI Each descriptor is also described by an ontology
I For rcdk this is used to classify descriptors into groups
Cheminformaticsin R
138/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Descriptor calculations
I Can evaluate a single descriptor or all availabledescriptors
I If a descriptor cannot be calculated, NA is returned (so noexceptions thrown)
I First need to get available descriptor class names
dnames <- get.desc.names()
I Gives you a vector of all descriptor class names
Cheminformaticsin R
139/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Descriptor calculations
I You can also specify descriptor categories, to get a subset
dnames <- get.desc.names("topological")
## what categories are available?
> get.desc.categories()
[1] "electronic" "protein" "topological" "geometrical" "constitutional" "hybrid"
I Depending on the input structures, some descriptors maynot make sense
I If you evaluate 3D descriptors from SMILES input, they’dall be set to NA
Cheminformaticsin R
140/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Evaluating the descriptors
I With a set of descriptor class names in hand, you canevaluate them
descs <- eval.desc(mols, dnames)
I descs will be a data.frame which can then be used in anyof the modeling functions
I Column names are the descriptor names provided by theCDK
I Good practise to put molecule titles in a column or in therow names
Cheminformaticsin R
141/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
The QSAR workflow
Cheminformaticsin R
142/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
The QSAR workflow
I Before model development you’ll need to clean themolecules, evaluate descriptors, generate subsets
I With the numeric data in hand, we can proceed tomodeling
I Before building predictive models, we’d probably explorethe dataset
I Normality of the dependent variableI Correlations between descriptors and dependent variableI Similarity of subsets
I Then we can go wild and build all the models that Rsupports
Cheminformaticsin R
143/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Cheminformaticsin R
144/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing fingerprints
I CDK provides several fingerprintsI Path-based, MACCS, E-State, PubChem
I Access them via get.fingerprint(...)
I Works on one molecule at a time, use lapply to process alist of molecules
I This method works with the fingerprint packageI Separate package to represent and manipulate fingerprint
data from various sources (CDK, BCI, MOE)I Uses C to perform similarity calculationsI Lots of similarity and dissimilarity metrics available
Cheminformaticsin R
145/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing fingerprints
mols <- load.molecules("data/dhfr_3d.sd")
## get a single fingerprint
fp <- get.fingerprint(mols[[1]], type="maccs")
## process a list of molecules
fplist <- lapply(mols, get.fingerprint, type="maccs")
I Each fingerprint is an S4 object
I See the fingerprint package man pages for moredetails
Cheminformaticsin R
146/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading fingerprints
I We can also read in fingerprint data generated by othertools - BCI, MOE, CDK
fplist <- fp.read("data/fp/fp.data", size=1052,
lf=bci.lf, header=TRUE)
I Easy to support new line-oriented fingerprint formats byproviding your own line parsing function
I See bci.lf for an example
Cheminformaticsin R
147/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Similarity metrics
I The fingerprint package implements 28 similarity anddissimilarity metrics
I All accessed via the distance function (so you call thiseven if you want similarity)
I Implemented in C, but still, large similarity matrixcalculations are not a good idea!
## similarity between 2 individual fingerprints
distance(fplist[[1]], fplist[[2]], method="tanimoto")
distance(fplist[[1]], fplist[[2]], method="mt")
## similarity matrix - compare similarity distributions
m1 <- fp.sim.matrix(fplist, "tanimoto")
m2 <- fp.sim.matrix(fplist, "carbo")
par(mfrow=c(1,2))
hist(m1, xlim=c(0,1))
hist(m2, xlim=c(0,1))
Cheminformaticsin R
148/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing datasets with fingerprints
I We can compare datasets based on a fingerprints
I Rather than perform pairwise comparisons, we evaluatethe normalized occurence of each bit, across the dataset
I Gives us a n-D vector - the “bit spectrum”
bitspec <- bit.spectrum(fplist)
plot(bitspec, type="l")
0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
Cheminformaticsin R
149/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Bit spectrum
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
Bit Position
Fre
quen
cy
I Only makes sense with structural key type fingerprints
I Longer fingerprints give better resolution
I Comparing bit spectra, via any distance metric, allows usto compare datasets in O(n) time, rather than O(n2) fora pairwise approach
0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
Cheminformaticsin R
150/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing datasets
Bit Position
Nor
mal
ized
Fre
quen
cy
0.0
0.2
0.4
0.6
0.8
1.0
0 50 100 150
Entire Dataset
0.0
0.2
0.4
0.6
0.8
1.0
Subset 1
0.0
0.2
0.4
0.6
0.8
1.0
Subset 2
Cheminformaticsin R
151/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Fingerprints & clustering
I Clustering large collections is not practical
I Any way to avoid a pairwise similarity matrix is desirable
## Get fingerprints
mols <- load.molecules("data/dhfr_3d.sd")
fplist <- lapply(mols, get.fingerprint, type="maccs")
## Gives us a similarity matrix
fpdist <- fp.sim.matrix(fplist)
## Convert to a distance matrix
fpdist <- as.dist(1-fpdist)
## Perform the clustering
clus <- hclust(fpdist, method="ward")
Cheminformaticsin R
152/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Fingerprints & clustering
05
1015
2025
3035
Hierarchical Clustering of the Sutherland DHFR Dataset
Cheminformaticsin R
153/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance
I Various studies comparing virtual screening methods
I Generally, metric of success is how many actives areretrieved in the top n% of the database
I Can be measured using ROC, enrichment factor, etc.I Exercise - evaluate performance of CDK fingerprints
using enrichment factorsI Load active and decoy moleculesI Evaluate fingerprintsI For each active, evaluate similarity to all other molecules
(active and inactive)I For each active, determine enrichment at a given
percentage of the database screened
For a given query molecule, order dataset by decreasingsimilarity, look at the top 10% and determine fraction of
actives in that top 10%
Cheminformaticsin R
154/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance
ActivesInactives
Query Targets
Similarity
Cheminformaticsin R
155/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance
I A good dataset to test this out is the MaximumUnbiased Validation datasets by Rohr & Baumann
I Derived from 17 PubChem bioassay datasets anddesigned to avoid analog bias and artifical enrichment
I As a result, 2D fingerprints generally show poorperformance on these datasets (by design)
I See here for a comparison of various fingerprints usingtwo of these datasets
0Rohrer, S.G et al, J. Chem. Inf. Model, 2009, 49, 169–1840Good, A. & Oprea, T., J. Chem. Inf. Model, 2008, 22, 169–1780Verdonk, M.L. et al, J. Chem. Inf. Model, 2004, 44, 793–806
Cheminformaticsin R
156/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance0.
00.
20.
40.
60.
81.
0
Query Molecule
Sim
ilarit
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Cheminformaticsin R
157/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Cheminformaticsin R
158/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Fragment based analysis
I Fragment based analysis can be a useful alternative toclustering, especially for large datasets
I Useful for identifying interesting seriesI Many fragmentation schemes are available
I ExhaustiveI Rings and ring assembliesI Murcko
I The CDK supports fragmentation (still needs work) intoMurcko frameworks and ring systems
I We can also use the NCGC fragmenter (seedata/fragmenter.jar) to exhaustively generatefragments
Cheminformaticsin R
159/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Getting fragments
I rcdk currently wraps the GenerateFragments class
I Only exposes the Murcko fragmentation method
I Lets look at directly working a CDK class and itsmethods
I Once we have the fragmenter we can perform a Murckofragmentation and get back the fragments as SMILES
Cheminformaticsin R
160/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Getting fragments
mol <- parse.smiles(
"c1cc(c(cc1c2c(nc(nc2CC)N)N)[N+](=O)[O-])NCc3ccc(cc3)C(=O)N4CCCCC4")
## get the fragmenter
fragmenter <- .jnew("org/openscience/cdk/tools/GenerateFragments")
## do fragmentation
.jcall(fragmenter, "V", "generateMurckoFragments",
.jcast(mol, "org/openscience/cdk/interfaces/IMolecule"),
TRUE, TRUE, as.integer(6))
## get fragments as SMILES
frags <- .jcall(fragmenter,
"[S",
"getMurckoFrameworksAsSmileArray")
Cheminformaticsin R
161/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Some caveats
I The fragment SMILES come out as non-aromatic
I There is supposed to be a way to get the appropriateSMILES out, currently looking into this
I Implies that if we want to look at fragment properties,we’re stuck
I However, we can still use these to look at series and/orlocal trends etc.
Cheminformaticsin R
162/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I Lets wrap the fragmentation procedure into a function
get.frags <- function(mol, fragmenter) {
.jcall(fragmenter, "V", "generateMurckoFragments",
.jcast(mol, "org/openscience/cdk/interfaces/IMolecule"),
TRUE, TRUE, as.integer(6))
return(.jcall(fragmenter, "[S", "getMurckoFrameworksAsSmileArray"))
}
Cheminformaticsin R
163/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I Now, we can get fragments for a set of molecules
mols <- load.molecules("data/dfhr_3d.sd")
fragmenter <- .jnew("org/openscience/cdk/tools/GenerateFragments")
frags <- lapply(mols, function(x, f) {
get.frags(x, f)
}, f=fragmenter)
Cheminformaticsin R
164/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I This gives us a list of fragment SMILES
> frags[1:3]
[[1]]
[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"
[[2]]
[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"
[[3]]
[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"
I What we want is a data.frame that associates fragments,fragment ID’s and the SMILES from which they arederived
Cheminformaticsin R
165/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
frags <- lapply(mols, function(x, f) {
tmp <- get.frags(x, f)
tmp <- data.frame(frag=I(tmp),
mol=I(get.property(x, "cdk:Title")))
return(tmp)
}, f=fragmenter)
frags <- do.call("rbind", frags)
I Now we have something useful to work with
> head(frags)
frag mol
1 C1=CC=C(C=C1)C=2C=NC=NC=2 1-pyrimethamine
2 C1=CC=C(C=C1)C=2C=NC=NC=2 1-3062
3 C1=CC=C(C=C1)C=2C=NC=NC=2 1-7364
4 C1=CC=C(C=C1)C=2C=NC=NC=2 1-115194
5 C1=CC=C(C=C1)OCC=2C=CN=CN=2 1-118203
6 C1=CC=C(C=C1)C=2C=NC=NC=2 1-118203
Cheminformaticsin R
166/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I Finally we generate some arbitrary fragment ID’s
ufrags <- data.frame(frag=I(unique(frags$frag)),
fid=I(sprintf("F%03d")),
1:length(unique(frags$frag)))))
frags <- merge(frags, ufrags, by="frag")
I . . . giving us
> tail(frags)
frag mol fid
963 O=C2NC=NC=3NC=C(CNC1=CC=CC=C1)C2=3 43-13 F239
964 O=C2NC4=NC=NC=C4(C=3CN(CC1=CC=CC=C1)CC2=3) 49-4 F265
965 O=S(=O)(C1=CC=CC=C1)C=2C=CC3=NC=NC=C3(N=2) 22-8 F150
966 O=S(=O)(C1=CC=CC=C1)C=2C=CC3=NC=NC=C3(N=2) 22-9 F150
967 O=S(=O)(C1=CC=CC=C1)C=2C=CC3=NC=NC=C3(N=2) 22-7 F150
968 O=S(=O)(NCCOC1=CC=CC=C1)C2=CC=CC=C2 1-122059 F068
Cheminformaticsin R
167/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Doing stuff with fragments
I Look at frequency of occurence of fragments
I Develop predictive models on fragment members, lookingfor local SAR
I Pseudo-cluster a dataset based on fragments
I Compound selection based on fragment membership
frag.counts <- by(frags, frags$frag, nrow)
barplot(frag.counts)
Cheminformaticsin R
168/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Characterizing fragment SAR
I Consider some of the larger series and see if there is anSAR within each of them
I StrategyI Identify fragments with 20 to 30 membersI Merge the fragment data with descriptor data for the
associated moleculesI For each fragment series, build an OLS model using
exhaustive feature selection
I I’ve precalculated a set of descriptors for this purpose (din data/frags.Rda) or you can generate fragmentsyourself
Cheminformaticsin R
169/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Characterizing fragment SAR
Observed Activity
Pre
dict
ed A
ctiv
ity
−0.5
0.0
0.5
1.0
1.5
2.0
−1 0 1 2
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
F018
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
F019
−1 0 1 2
● ●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
F052
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
F128
−1 0 1 2
−0.5
0.0
0.5
1.0
1.5
2.0
●
●
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
F252
Predicted versus observed activity for 5 fragment series, basedon OLS models and exhaustive feature selection
Cheminformaticsin R
170/189
Part IV
Cheminformatics & R - rpubchem
Cheminformaticsin R
171/189
Public chemical and biological data
Cheminformaticsin R
172/189
Working with PubChem data
Bioassay data (numeric)
Assay metadata(text)
Chemical structures
Process
Extract
Process
StandardizeCleanDescriptors
ProcessExtractMerge
Model
Cheminformaticsin R
173/189
Working with PubChem data in R
I Access PubChem datasets using idiomatic R
I Minimize extra work (merge data, add metadata etc)
I End up with a data.frame, ready for modelingI The usual sequence of tasks would involve
I Get the assay dataI Get structure data associated with the assayI Proceed with modeling
Cheminformaticsin R
173/189
Working with PubChem data in R
I Access PubChem datasets using idiomatic R
I Minimize extra work (merge data, add metadata etc)
I End up with a data.frame, ready for modelingI The usual sequence of tasks would involve
I Get the assay dataI Get structure data associated with the assayI Proceed with modeling
Cheminformaticsin R
174/189
Getting assay data
I The workhorse method is get.assay which takes theAID as argument
assay <- get.assay(2018, quiet=FALSE)
I Returns a data.frame derived from the assay data CSVfile and the assay description XML file form PubChem
I The description, comments and field descriptions andtypes are available as attributes of the data.frame
I Note - no structures are included at this point
Cheminformaticsin R
175/189
Other ways to get assay data
I If you’re looking for a group of assays, say related toHDAC inhibitors, perform a keyword search
## gives a vector of integer AIDs
aids <- find.assay.id("HDAC", quiet=FALSE)
## what are the assays?
sapply(aids[1:10], get.assay.desc)
Cheminformaticsin R
176/189
Other ways to get assay data
I Find assays that a compound is active in
cid <- 68368
aids <- get.aid.by.cid(cid, type="active")
I Gives a vector AIDs that the compound is active in
I If you specify type="raw", a summary is returned
Cheminformaticsin R
177/189
Getting structure informationI Get structure data by CID or SID
cids <- get.cid(assay$PUBCHEM.CID[1:10])
I Gives you a data.frame containing name, structure andsome properties
> str(structs)
’data.frame’: 10 obs. of 11 variables:
$ CID : chr "644436" "645300" "645983" "648874" ...
$ IUPACName : chr "2-(2-methylphenyl)-1,3-benzoxazol-5-amine"
$ CanonicalSmile : chr "CC1=CC=CC=C1C2=NC3=C(O2)C=CC(=C3)N"
$ MolecularFormula : chr "C14H12N2O" "C18H15N3O2" "C24H34N6O3"
$ MolecularWeight : num 224 305 455 407 290 ...
$ TotalFormalCharge : int 0 0 0 0 0 0 0 0 0 0
$ XLogP : num 3.1 3.2 3 4.5 2.1 2.9 1.8 2.5 0.7 1.7
$ HydrogenBondDonorCount : int 1 0 1 1 2 2 1 1 2 2
$ HydrogenBondAcceptorCount: int 3 4 7 5 4 3 7 8 5 4
$ HeavyAtomCount : int 17 23 33 29 21 26 33 31 20 21
$ TPSA : num 52 57 94.4 68 84.5 70.7 94.4 125 104 69.6
Cheminformaticsin R
178/189
Ready to model
I Having gotten the data.frame of structures we canparse the SMILES into CDK objects and evaluatedescriptors
I However, get.cid (and get.sid) will not work if youtry and retrieve too many at one go
I The current design makes an Entrez URL, which canbecome too large and so is rejected
I Will be switching to the use of PUG
I In the meantime, retreive structure data in chunks
Cheminformaticsin R
179/189
Getting CID data in chunks
## will likely fail
structs <- get.cid(assay$PUBCHEM.CID)
## better way, 300 CIDs at a time
library(itertools)
chunks <- as.list(ichunk(assay$PUBCHEM.CID, 300))
structs <- lapply(chunks, get.cid)
structs <- do.call("rbind", structs)
Cheminformaticsin R
180/189
Part V
Extending rcdk
Cheminformaticsin R
181/189
Why extend?
I rcdk doesn’t cover the whole CDK API
I I add functions as and when I need them (or if someoneasks)
I If you want to access CDK classes and methods, you canuse rJava (such as .jnew and .jcall)
I This can get a bit cumbersomeI Use the soure code for inspiration
I For more complex stuff, you can write Java code thatgets installed along with rcdk
I Ideally this would get contributed back and incorporatedinto the rcdk.jar file that is bundled with the rcdkpackage
Cheminformaticsin R
182/189
Structure of the rcdk package
I If you add new functionality, good idea to add unit tests(using RUnit) under rcdk/inst/unitTests/
Cheminformaticsin R
183/189
Structure of the cdkr project
I Within rcdkjar/, run ant jar to build the jar file andcopy into the rcdk package directory
I Generally, no need to touch rcdklibs/
Cheminformaticsin R
184/189
Writing Java in R
I Don’t like it, but sometimes it’s easier than firing up theJava IDE. Need to be familiar with the CDK API
I Three main functionsI .jnew
I .jcall
I .jcast
Cheminformaticsin R
185/189
Writing Java in R
I First, lets look at a better way to get the atom count ofa molecule
I Right now, we get a molecule and get the list of atomobjects and return the length of the list
I But IAtomContainer has a method getAtomCount()
m <- parse.smiles("CCC")
.jcall(mol, "I", "getAtomCount")
I For static methods, the first cargument can be the fullyqualified class name
I You do need to specify the return type in JNI notation
Cheminformaticsin R
186/189
Writing Java in R
JNI Signature Java Type
I intZ booleanB byteC charLfully-qualified-class; fully-qualified-class[type type[]
I For the L signature, remember the semi-colon at the end
Cheminformaticsin R
187/189
Writing Java in R
I Another example is getting the largest connectedcomponent
I We make use of the ConnectivityChecker class andthe partitionIntoMolecules method
mol <- parse.smiles("CCC(=O)C[O-].[Na+]")
molSet <- .jcall("org.openscience.cdk.graph.ConnectivityChecker",
"Lorg/openscience/cdk/interfaces/IMoleculeSet;",
"partitionIntoMolecules", mol)
ncomp <- .jcall(molSet, "I", "getMoleculeCount")
Cheminformaticsin R
188/189
Writing Java in Java
I When writing Java in R gets too tedious, switch to Javain Java
I You could write classes to do the work and then packagethem into a jar
I Copy the jar into the inst/cont of the rcdk directory
I Edit .First.lib in R/rcdk.R to add this jar to theclasspath
I Alternatively edit the Java sources for the rcdk packageI Need to access the Github repository
Cheminformaticsin R
189/189
Future developments
I Update rpubchem to use PUG throughout
I Data table and depictions
I Streamline I/O and molecule configuration
I Add more atom and bond level operationsI Convert from jobjRef to S4 objects and vice versa
I Would allow for serialization of CDK data classesI Is it worth the effort?
I Your suggestions