Upload
cpmarqui
View
54
Download
6
Tags:
Embed Size (px)
Citation preview
R Project R is a free software environment for statistical computing, data
manipulation, calculation and graphical display (1,2) For those interested, the associated Bioconductor project
provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems (2-4).
Facilitates the inclusion of biological metadata from literature data such as PubMed.
Provides access to powerful statistical and graphical methods.
References: 1- The R Project for Statistical Computing: http://www.r-project.org/2- W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to RNotes on R: A
Programming Environment for Data Analysis and Graphics. Version 2.14.2 (2012-02-29).3-R & Bioconductor Manual. Author: Thomas Girke, UC.
Riversidehttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-R-Basics4- Bioconductor: http://www.bioconductor.org/
Install R1- Install the latest release of R according to instructions
provided in The R Project for Statistical Computing- http://www.r-project.org/
2- Onced installed, open the R command window (R console) 3- In the R Console the > prompt in red color is where you
type the commands.4- Any text or comment in R beginning with the hash #
symbol is ignored.
References1- The R Project for Statistical Computing: http://www.r-project.org/2- Bioconductor: http://www.bioconductor.org/3-R Tutorials. W.B. King. 2010. http://ww2.coastal.edu/kingw/statistics/R-tutorials/preliminaries.html
Install packages in R1- In the R Console type the following in the R command
window to connect to Bioconductor and install packages: source("http://bioconductor.org/biocLite.R")2- request instalation of the package type: biocLite()3- Install packages, "RISmed" , and "tm" by typing (see next
slide) : biocLite(c("RISmed", "tm")) 3- Install package "ggplot2" -type: biocLite( "ggplot2")) Package ‘RISmed’ is to download content from NCBI databases.Package ‘tm’ is for text mining functionalitiesPackage ‘ggplot2’ is for data visualizationReferences1- Bioconductor: http://www.bioconductor.org/RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package
version 2.1.0. http://CRAN.R-project.org/package=RISmedtm package: Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tmggplot2 package: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. http://had.co.nz/ggplot2/book also http://cran.r-project.org/web/packages/ggplot2/index.html
Query pubmed titles for oncolytic virus using RISmed
Type the following in the R console: library(RISmed) onc<- EUtilsSummary("oncolytic
virus[Majr]") onc # [1] "\"oncolytic viruses\"[MeSH Major Topic]"
fetch.onc <- EUtilsGet(onc) fetch.onc # PubMed query: "oncolytic viruses"[MeSH Major Topic] Records: 713
onc.tit<-ArticleTitle(fetch.onc) onc.tit <-unlist(onc.tit) # export title results as text file write(onc.tit,
file="title_oncolytic_virus.txt")
Query pubmed MESH topic for oncolytic virus using RISmed
# Continue to type in the R console the following: mh<-Mesh(fetch.onc) mh.per.row<- lapply(1:length(mh), function(i){ mh.df.rbind = as.data.frame(do.call(rbind,
Mesh(fetch.onc)[i])) mh.per.row<-paste(mh.df.rbind$Heading,
collapse= ";") }) mh.list<-unlist(mh.per.row) # The following is to export mesh results as text
file write(mh.list , file="mesh_oncolytic_virus.txt")
View results in excel # export both title and mesh results as text file to view as table with excel tit.mh<-cbind(onc.tit, mh.list) tit.mh[1:10,] # view first 10 results write.table(tit.mh, file="tit_mesh_oncolytic_virus.txt ", row.names=F,
sep="\t") # !!open file in excel
Column containing titles Column containing corresponding Mesh terms
Preparing forText Mining Analysis
Type getwd() in the R console to display the R working directory. In my case: [1] "C:/Documents and Settings/PMarqui/My Documents"
Now create a new folder in the R working directory and give a name to it (for ex. OncolyticVirus)
Use the new folder to place two of the recently created text files: title_oncolytic_virus.txt and mesh_oncolytic_virus.txt
Start the Text Mining Analysis
Text Mining Analysis# Type the following in the R Console
library(tm) #loads the text mining package my.corpus<-Corpus(DirSource("OncolyticVirus"),
readerControl=list(reader=readPlain)) # Note that "OncolyticVirus" refer to the name of the newly
created folder. In my.corpus<-Corpus(DirSource(" you must use the name given to the folder containing the 2 text files
my.corpus <- tm_map(my.corpus, stripWhitespace) # Removes extra whitespace
my.corpus <- tm_map(my.corpus, gsub, pattern="[^-[:alnum:][:space:]]", replacement=" ") # remove punctuation except dash " - "
# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes numbers- optional
Text Mining Analysis# Continue and type the following code in the R Console:
my.corpus <- tm_map(my.corpus, tolower) #Conversion to lower case letters
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) # Removes stopwords
my.corpus <- tm_map(my.corpus, stemDocument) # removes suffixes from words to get common origin
my.corpus.matrix<-TermDocumentMatrix(my.corpus) # Creates a Term-Document matrix
mat.my.corpus<- as.matrix(my.corpus.matrix) # Creates a matrix
my.corpus.df<-as.data.frame(mat.my.corpus) # Create data frame from matrix displaying all the terms in any of the 2 documents.
my.corpus.df[200:250,1:2] # view some of the terms
copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df to keep original data frame for later
Text Mining Analysis# Continue and type the following code in the R
Console: #sort the most freq mesh term in the data frame my.corpus.df<-
my.corpus.df[ order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]
# assign the 50 most freq mesh term to xx xx<- my.corpus.df[1:50,]
# view the top 5 most freq mesh term- to view you can also use "head( xx,5)" both are equivalent
xx[1:5,] #sort the 50 most freq mesh term in increasing order (for plot
visualization) xx<- xx[ order(xx$mesh_oncolytic_virus.txt, decreasing =
FALSE),]
Text Mining Analysis# Continue and type the following code in the R
Console:
# Plot the 50 most frequent mesh terms use library ggplot2 library(ggplot2) Terms<- rownames(xx) Mesh.count<-xx$mesh_oncolytic_virus.txt
ggplot(xx) + geom_point(aes(Terms, Mesh.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw()
p1<-last_plot() + scale_x_discrete(limits=(Terms)) p1
Text Mining Analysis# Continue and type the following code in the R Console:
# now select the most freq title term. Therfore sort title in decreasing order
my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),]
xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy
xy[1:5,] # view the top 5 most freq title term
#sort the 50 most freq title term in increasing order (for plot visualization)
xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),]
# Plot the 50 most frequent title termsrequire(ggplot2)
Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt ggplot(xy) + geom_point(aes(Terms, Title.count ), stat =
"identity", fill = "darkblue")+ coord_flip() + theme_bw() p2<-last_plot() + scale_x_discrete(limits=(Terms)) p2
Text Mining Analysis# Continue and type the following code in the R Console: Create separate data frames for each frequency type
my.corpus.sub1.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt>0) # subset common terms in the 2 documents
my.corpus.sub1.df[200:300,1:2] # view some of the subset terms my.corpus.sub2.df<- subset(my.corpus.df,
mesh_oncolytic_virus.txt==0 & title_oncolytic_virus.txt>0) # terms present in title but not in mesh
my.corpus.sub2.df[200:300,1:2] # to view some terms (200-300) my.corpus.sub3.df<-subset(my.corpus.df,
mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt==0) # terms present in mesh but not in title
my.corpus.sub3.df[200:300,1:2] # view some of the terms
#CORRELATE terms in title and mesh cor(my.corpus.df$title_oncolytic_virus.txt,
my.corpus.df$mesh_oncolytic_virus.txt)# correlation coefficient is [1] 0.4442518
Text Mining Analysis# bellow generates a term frequency vector from a text document termFrequency <-rowSums(as.matrix(my.corpus.matrix))
my.tdm <- TermDocumentMatrix(my.corpus, control = list(minWordLength = 1))
my.tdm #A term-document matrix (2632 terms, 2 documents)
# bellow is to select those terms from term-document matrix which occur at least 100 times
findFreqTerms(my.tdm[,1], lowfreq=100) findFreqTerms(my.tdm[,2], lowfreq=100)
Text Mining Analysis# Code for plot 3: most frequent title terms with the corresponding mesh terms
my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),]
xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy
#sort the 50 most freq title term in increasing order (for plot visualization)
xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),]
# Plot the 50 most frequent title terms and the corresponding mesh terms included in the 50 most frequent title terms Terms<- rownames(xx)
Title.count<-xy$title_oncolytic_virus.txt Mesh.count<-xy$mesh_oncolytic_virus.txt ggplot(xy, aes(Terms)) + geom_point(aes(y = Mesh.count, colour =
"Mesh.count")) + geom_point(aes(y = Title.count, colour = "Title.count"))
p3<-last_plot() + coord_flip() p3<-last_plot() + scale_x_discrete(limits=(Terms)) p3
Text Mining Analysis
# Code for plot 4: most frequent title terms and most frequent mesh terms
top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms
Terms<- rownames(top50.mh.ti) # assign rownames to Terms
msh<-top50.mh.ti$mesh_oncolytic_virus.txt titl<- top50.mh.ti$title_oncolytic_virus.txt p4 <- ggplot(top50.mh.ti) p4 <- p4 + geom_text(aes(x = msh, y = titl,
label = Terms)) p4
Text Mining Analysis# plot 5: most frequent title terms and most frequent mesh terms
my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),]
xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy
xy[1:5,] # view the top 5 most freq title term #sort the 50 most freq title term in increasing order (for plot
visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing =
FALSE),] top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and
title terms top50.mh.ti$Term<-rownames(top50.mh.ti) rownames(top50.mh.ti$Term) = NULL colnames(top50.mh.ti)[1] <- "msh" # change col name colnames(top50.mh.ti)[2] <- "title" # change col name
Text Mining Analysis
# plot 5: most frequent title terms and most frequent mesh terms library("reshape2")
# library("reshape2") is used to transform wide format data by means of the melt function. The melt function takes data in wide format and stacks a set of columns into a single column of data.
top50.melt<- melt(top50.mh.ti, measure.vars = c("title", "msh")) top50.melt p <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value,
colour = variable)) + geom_point() + coord_flip() p5<-last_plot() + scale_x_discrete(limits=(top50.melt$Term)) p5
Reference for reshape package: Hadley Wickham (2007). Reshaping Data with the reshape Package.
Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
Text Mining Analysis
# plot 5: most frequent title terms and most frequent mesh terms
p5 <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip()
p5