28
Search Pubmed with R Part1 and Part2

Search Pubmed With R Part1Part2

Embed Size (px)

Citation preview

Search Pubmed with RPart1 and Part2

R Project R is a free software environment for statistical computing, data

manipulation, calculation and graphical display (1,2) For those interested, the associated Bioconductor project

provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems (2-4).

Facilitates the inclusion of biological metadata from literature data such as PubMed.

Provides access to powerful statistical and graphical methods.

References: 1- The R Project for Statistical Computing: http://www.r-project.org/2- W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to RNotes on R: A

Programming Environment for Data Analysis and Graphics. Version 2.14.2 (2012-02-29).3-R & Bioconductor Manual. Author: Thomas Girke, UC.

Riversidehttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-R-Basics4- Bioconductor: http://www.bioconductor.org/

Install R1- Install the latest release of R according to instructions

provided in The R Project for Statistical Computing- http://www.r-project.org/

2- Onced installed, open the R command window (R console) 3- In the R Console the > prompt in red color is where you

type the commands.4- Any text or comment in R beginning with the hash #

symbol is ignored.

References1- The R Project for Statistical Computing: http://www.r-project.org/2- Bioconductor: http://www.bioconductor.org/3-R Tutorials. W.B. King. 2010. http://ww2.coastal.edu/kingw/statistics/R-tutorials/preliminaries.html

Install packages in R1- In the R Console type the following in the R command

window to connect to Bioconductor and install packages: source("http://bioconductor.org/biocLite.R")2- request instalation of the package type: biocLite()3- Install packages, "RISmed" , and "tm" by typing (see next

slide) : biocLite(c("RISmed", "tm")) 3- Install package "ggplot2" -type: biocLite( "ggplot2")) Package ‘RISmed’ is to download content from NCBI databases.Package ‘tm’ is for text mining functionalitiesPackage ‘ggplot2’ is for data visualizationReferences1- Bioconductor: http://www.bioconductor.org/RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package

version 2.1.0. http://CRAN.R-project.org/package=RISmedtm package: Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tmggplot2 package: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. http://had.co.nz/ggplot2/book also http://cran.r-project.org/web/packages/ggplot2/index.html

The R Console

Query pubmed titles for oncolytic virus using RISmed

Type the following in the R console: library(RISmed) onc<- EUtilsSummary("oncolytic

virus[Majr]") onc # [1] "\"oncolytic viruses\"[MeSH Major Topic]"

fetch.onc <- EUtilsGet(onc) fetch.onc # PubMed query: "oncolytic viruses"[MeSH Major Topic] Records: 713

onc.tit<-ArticleTitle(fetch.onc) onc.tit <-unlist(onc.tit) # export title results as text file write(onc.tit,

file="title_oncolytic_virus.txt")

Query pubmed MESH topic for oncolytic virus using RISmed

# Continue to type in the R console the following: mh<-Mesh(fetch.onc) mh.per.row<- lapply(1:length(mh), function(i){ mh.df.rbind = as.data.frame(do.call(rbind,

Mesh(fetch.onc)[i])) mh.per.row<-paste(mh.df.rbind$Heading,

collapse= ";") }) mh.list<-unlist(mh.per.row) # The following is to export mesh results as text

file write(mh.list , file="mesh_oncolytic_virus.txt")

View results in excel # export both title and mesh results as text file to view as table with excel tit.mh<-cbind(onc.tit, mh.list) tit.mh[1:10,] # view first 10 results write.table(tit.mh, file="tit_mesh_oncolytic_virus.txt ", row.names=F,

sep="\t") # !!open file in excel

Column containing titles Column containing corresponding Mesh terms

Preparing forText Mining Analysis

Type getwd() in the R console to display the R working directory. In my case: [1] "C:/Documents and Settings/PMarqui/My Documents"

Now create a new folder in the R working directory and give a name to it (for ex. OncolyticVirus)

Use the new folder to place two of the recently created text files: title_oncolytic_virus.txt and mesh_oncolytic_virus.txt

Start the Text Mining Analysis

Text Mining Analysis# Type the following in the R Console

library(tm) #loads the text mining package my.corpus<-Corpus(DirSource("OncolyticVirus"),

readerControl=list(reader=readPlain)) # Note that "OncolyticVirus" refer to the name of the newly

created folder. In my.corpus<-Corpus(DirSource(" you must use the name given to the folder containing the 2 text files

my.corpus <- tm_map(my.corpus, stripWhitespace) # Removes extra whitespace

my.corpus <- tm_map(my.corpus, gsub, pattern="[^-[:alnum:][:space:]]", replacement=" ") # remove punctuation except dash " - "

# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes numbers- optional

Text Mining Analysis# Continue and type the following code in the R Console:

my.corpus <- tm_map(my.corpus, tolower) #Conversion to lower case letters

my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) # Removes stopwords

my.corpus <- tm_map(my.corpus, stemDocument) # removes suffixes from words to get common origin

my.corpus.matrix<-TermDocumentMatrix(my.corpus) # Creates a Term-Document matrix

mat.my.corpus<- as.matrix(my.corpus.matrix) # Creates a matrix

my.corpus.df<-as.data.frame(mat.my.corpus) # Create data frame from matrix displaying all the terms in any of the 2 documents.

my.corpus.df[200:250,1:2] # view some of the terms

copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df to keep original data frame for later

Text Mining Analysis# Continue and type the following code in the R

Console: #sort the most freq mesh term in the data frame my.corpus.df<-

my.corpus.df[ order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]

# assign the 50 most freq mesh term to xx xx<- my.corpus.df[1:50,]

# view the top 5 most freq mesh term- to view you can also use "head( xx,5)" both are equivalent

xx[1:5,] #sort the 50 most freq mesh term in increasing order (for plot

visualization) xx<- xx[ order(xx$mesh_oncolytic_virus.txt, decreasing =

FALSE),]

Text Mining Analysis# Continue and type the following code in the R

Console:

# Plot the 50 most frequent mesh terms use library ggplot2 library(ggplot2) Terms<- rownames(xx) Mesh.count<-xx$mesh_oncolytic_virus.txt

ggplot(xx) + geom_point(aes(Terms, Mesh.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw()

p1<-last_plot() + scale_x_discrete(limits=(Terms)) p1

Text Mining AnalysisVIEW the 50 most frequent mesh term

Part 2

Text Mining Analysis# Continue and type the following code in the R Console:

# now select the most freq title term. Therfore sort title in decreasing order

my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),]

xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy

xy[1:5,] # view the top 5 most freq title term

#sort the 50 most freq title term in increasing order (for plot visualization)

xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),]

# Plot the 50 most frequent title termsrequire(ggplot2)

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt ggplot(xy) + geom_point(aes(Terms, Title.count ), stat =

"identity", fill = "darkblue")+ coord_flip() + theme_bw() p2<-last_plot() + scale_x_discrete(limits=(Terms)) p2

Text Mining AnalysisVIEW the 50 most frequent title term

Text Mining Analysis# Continue and type the following code in the R Console: Create separate data frames for each frequency type

my.corpus.sub1.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt>0) # subset common terms in the 2 documents

my.corpus.sub1.df[200:300,1:2] # view some of the subset terms my.corpus.sub2.df<- subset(my.corpus.df,

mesh_oncolytic_virus.txt==0 & title_oncolytic_virus.txt>0) # terms present in title but not in mesh

my.corpus.sub2.df[200:300,1:2] # to view some terms (200-300) my.corpus.sub3.df<-subset(my.corpus.df,

mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt==0) # terms present in mesh but not in title

my.corpus.sub3.df[200:300,1:2] # view some of the terms

#CORRELATE terms in title and mesh cor(my.corpus.df$title_oncolytic_virus.txt,

my.corpus.df$mesh_oncolytic_virus.txt)# correlation coefficient is [1] 0.4442518

Text Mining Analysis# bellow generates a term frequency vector from a text document termFrequency <-rowSums(as.matrix(my.corpus.matrix))

my.tdm <- TermDocumentMatrix(my.corpus, control = list(minWordLength = 1))

my.tdm #A term-document matrix (2632 terms, 2 documents)

# bellow is to select those terms from term-document matrix which occur at least 100 times

findFreqTerms(my.tdm[,1], lowfreq=100) findFreqTerms(my.tdm[,2], lowfreq=100)

For part 2

Text Mining Analysis# Code for plot 3: most frequent title terms with the corresponding mesh terms

my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),]

xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy

#sort the 50 most freq title term in increasing order (for plot visualization)

xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),]

# Plot the 50 most frequent title terms and the corresponding mesh terms included in the 50 most frequent title terms Terms<- rownames(xx)

Title.count<-xy$title_oncolytic_virus.txt Mesh.count<-xy$mesh_oncolytic_virus.txt ggplot(xy, aes(Terms)) + geom_point(aes(y = Mesh.count, colour =

"Mesh.count")) + geom_point(aes(y = Title.count, colour = "Title.count"))

p3<-last_plot() + coord_flip() p3<-last_plot() + scale_x_discrete(limits=(Terms)) p3

Text Mining Analysisplot 3: most frequent title terms with the corresponding mesh terms

Text Mining Analysis

# Code for plot 4: most frequent title terms and most frequent mesh terms

top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms

Terms<- rownames(top50.mh.ti) # assign rownames to Terms

msh<-top50.mh.ti$mesh_oncolytic_virus.txt titl<- top50.mh.ti$title_oncolytic_virus.txt p4 <- ggplot(top50.mh.ti) p4 <- p4 + geom_text(aes(x = msh, y = titl,

label = Terms)) p4

Text Mining Analysisplot 4: most frequent title terms and most frequent mesh terms

Text Mining Analysis# plot 5: most frequent title terms and most frequent mesh terms

my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),]

xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy

xy[1:5,] # view the top 5 most freq title term #sort the 50 most freq title term in increasing order (for plot

visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing =

FALSE),] top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and

title terms top50.mh.ti$Term<-rownames(top50.mh.ti) rownames(top50.mh.ti$Term) = NULL colnames(top50.mh.ti)[1] <- "msh" # change col name colnames(top50.mh.ti)[2] <- "title" # change col name

Text Mining Analysis

# plot 5: most frequent title terms and most frequent mesh terms library("reshape2")

# library("reshape2") is used to transform wide format data by means of the melt function. The melt function takes data in wide format and stacks a set of columns into a single column of data.

top50.melt<- melt(top50.mh.ti, measure.vars = c("title", "msh")) top50.melt p <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value,

colour = variable)) + geom_point() + coord_flip() p5<-last_plot() + scale_x_discrete(limits=(top50.melt$Term)) p5

Reference for reshape package: Hadley Wickham (2007). Reshaping Data with the reshape Package.

Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.

Text Mining Analysis

# plot 5: most frequent title terms and most frequent mesh terms

p5 <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip()

p5

Text Mining Analysisplot 5: most frequent title terms and most frequent mesh terms