Samuel O’[email protected]
Supervisor: Prof. Jiuyong [email protected]
Associate Supervisor: Dr. Jixue [email protected]
Information Retrieval of microRNA Research from
Biomedical Literature
Motivation Background Research Question Contribution Implementation References
Do not remove this notice.
Copyright Notice
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been produced and communicated to you by or on behalf of the University of South Australia pursuant to Part VB of the
Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you
may be the subject of copyright protection under the Act.
Do not remove this notice.
Motivation Background Research Question Contribution Implementation References
Overview
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution
Implementation Examples
References
Motivation Background Research Question Contribution Implementation References
Motivation
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
microRNA research is increasing exponentially
Databases can not be curated fast enough A researcher can not be “current” in the
field of microRNA Automatic curation tools exist for other
areas of biomedical research
Motivation Background Research Question Contribution Implementation References
microRNA – What are they?
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
microRNA are small non-coding lengths of RNA
They inhibit the creation of proteins
Video from rossettagenomics.com
Motivation Background Research Question Contribution Implementation References
miRBase
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
A database of microRNA sequences and annotations.
Human microRNA 150 is also called MIR150, hsa-mir-150, MIRN150 etc.
miRBase provides the human readable name as well as a machine readable ID
Example: hsa-mir-150 has an ID of MI0000479 and
HGNC:MIR150
A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deep-sequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.
Motivation Background Research Question Contribution Implementation References
Disease Related Enzymes
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Finds occurrences of an Enzyme and a Disease mentioned in the same sentence
Classifies their relationship using a Support Vector Machine
Uses a training-set of pre-classified sentences.
Example: “Chronic granulomatous disease (CGD) results from
mutations of phagocyte NADPH oxidase.” Classified as “Causal Interaction”
C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for disease-related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.
Motivation Background Research Question Contribution Implementation References
Gene Name Disambiguation
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Genes can have many different names or variations
Humans can understand “context”, for machines this is a challenge
Example: Five sentences in the paper refer to different genes. Four of these are referring to a human gene,
however the fifth is ambiguous as a human gene or a fly gene.
C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.
Motivation Background Research Question Contribution Implementation References
LINNAEUS – Species Identification
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
LINNAEUS uses a set of simple regular expressions to find indicators of what species a text is refering to.
In my research I use a modified list to incorporate the specific MicroRNA domain knowledge.
Example -These words can all be used when talking about humans (ID: 9606): [hH]umans? [pP]atients? [pP]articipants?
[wW]oman [wW]omen [mM]en [gG]irls? [bB]oys? [pP]eoples? [Cc]hild(ren)? [Ii]nfants? [Pp]ersons?
Gerner, M, Nenadic, G & Bergman, C 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.
Motivation Background Research Question Contribution Implementation References
Research Question
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
What is the most suitable technique for discovering and classifying microRNA - gene
relationships from biomedical literature?
Motivation Background Research Question Contribution Implementation References
Contribution
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
1. A normalisation and disambiguation technique for gene names will be adapted to fit the unique microRNA ontology.
2. Automatic curation of microRNA and gene relationships in biomedical literature. (Not completed yet)
Motivation Background Research Question Contribution Implementation References
MYSQL Database Backend
Table Name Rows
Abstracts ID Abstract Title
Stop_Abstracts ID Abstract Title
Species ID Name
Micro_Prefix Prefix Species_ID
Species_Mentions Abstract_ID Species_ID Sentence_Num Word_Num
MicroRNA_Mentions Abstract_ID Micro_ID Sentence_Num Word_Num
Motivation Background Research Question Contribution Implementation References
Full Example – Original Abstract
microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma.
The Epstein-Barr virus (EBV) is an oncogenic human Herpes virus found in ~15% of diffuse large B-cell lymphoma (DLBCL). EBV encodes miRNAs and induces changes in the cellular miRNA profile of infected cells. MiRNAs are small, non-coding RNAs of ~19-26?nt which suppress protein synthesis by inducing translational arrest or mRNA degradation. Here, we report a comprehensive miRNA-profiling study and show that hsa-miR-424, -223, -199a-3p, -199a-5p, -27b, -378, -26b, -23a, -23b were upregulated and hsa-miR-155, -20b, -221, -151-3p, -222, -29b/c, -106a were downregulated more than 2-fold due to EBV-infection of DLBCL. All known EBV miRNAs with the exception of the BHRF1 cluster as well as EBV-miR-BART15 and -20 were present. A computational analysis indicated potential targets such as c-MYB, LATS2, c-SKI and SIAH1. We show that c-MYB is targeted by miR-155 and miR-424, that the tumor suppressor SIAH1 is targeted by miR-424, and that c-SKI is potentially regulated by miR-155. Downregulation of SIAH1 protein in DLBCL was demonstrated by immunohistochemistry. The inhibition of SIAH1 is in line with the notion that EBV impedes various pro-apoptotic pathways during tumorigenesis. The down-modulation of the oncogenic c-MYB protein, although counter-intuitive, might be explained by its tight regulation in developmental processes.
Motivation Background Research Question Contribution Implementation References
Full Example – Stopwords Removed
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Epstein-Barr virus EBV oncogenic human Herpes virus found 15 diffuse large B-cell lymphoma DLBCL
… MiRNAs small non-coding RNAs 19-26 nt suppress
protein synthesis inducing translational arrest mRNA degradation . we report comprehensive miRNA-profiling study show hsa-miR-424 223 199a-3p 199a-5p 27b 378 26b 23a 23b upregulated hsa-miR-155 20b 221 151-3p 222 29b c 106a downregulated 2-fold due EBV-infection DLBCL
…
Motivation Background Research Question Contribution Implementation References
Full Example – Stopwords Removed
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
First replace all full stops with “ . “ and remove the final full stop:◦ $abstract =~ s/([^\s])\.\s+/$1 . /gm;◦ $abstract =~ s/([^\s])\.\s*\Z/$1/gm;◦ “Ph.D” will not be affected by this
Then split the words into the following chunks:◦ $abstract =~ /(([a-zA-Z0-9']+-)*[a-zA-Z0-9'\.]+)/g)◦ And remove the word if it matches Lingua’s stopword list (James
2002).◦ Essentially this algorithm splits each word up but still keeps hyphens,
apostrophes and numbers.◦ Most stopword algorithms remove numbers and hyphens but they are
essential for microRNA detection.
Motivation Background Research Question Contribution Implementation References
Full Example – Analysis
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
These two lines from the text specify 17 different MicroRNAs:
hsa-miR-424 223 199a-3p 199a-5p 27b 378 26b 23a 23b
hsa-miR-155 20b 221 151-3p 222 29b c 106a
The“hsa-” prefix confirms to us that this is a human sequence.
If there are competing species in the same document we use a distance function to calculate which one to use, and the others we use as backups.
Motivation Background Research Question Contribution Implementation References
Full Example – Detection
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
This regular expression captures all microRNA written in the standard format:◦ m/^((([a-zA-Z]+-)?(mir|let)-?)[\d][\d\-a-z]*$)/mi
For example:◦ hsa-miR-27b◦ hsa-miR-29b-1◦ let-7b◦ MIR298A
It does not capture the following string:◦ hsa-miR-424 -223◦ It would only see the first microRNA, but miss 223◦ My algorithm appends each number to the last seen microRNA prefix
if the number occurs immediately after a valid microRNA
Motivation Background Research Question Contribution Implementation References
Full Example – Real Detection
Abstract_ID Micro_ID Sentence Word Micro_Name
21062812 MI0000079 3 13 hsa-mir-23a
21062812 MI0000084 3 12 hsa-mir-26b
21062812 MI0000298 3 18 hsa-mir-221
21062812 MI0000299 3 20 hsa-mir-222
21062812 MI0000300 3 7 hsa-mir-223
21062812 MI0000439 3 14 hsa-mir-23b
21062812 MI0000440 3 10 hsa-mir-27b
21062812 MI0000113 3 11 hsa-mir-106a
21062812 MI0000681 3 16 hsa-mir-155
21062812 MI0001446 3 6 hsa-mir-424
21062812 MI0000105 3 8 hsa-mir-29b-1
21062812 MI0000105 3 8 hsa-mir-29b-2
21062812 MI0000735 3 9 hsa-mir-29c
21062812 MI0001519 3 17 hsa-mir-20b
mir-199a-3pNew Terminology
mir-199a-5pNew Terminology
mir-378Ambiguous Entries
mir-151-3pNew Terminology
Missing Entries:
Motivation Background Research Question Contribution Implementation References
Full Example – Review
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
To Review the effectiveness of this algorithm:1. We will manually annotate a random selection of abstracts with
correct MicroRNA information. Pros:
Accurate, wide selection of different types of writing Cons:
Slow and laborious
2. We will do a reverse lookup from MIRBase (which references pubmed IDs and assume that they contain the microRNA from MIRBase in the abstract.
Pros: Fast and Automated
Cons: The microRNA might not be mentioned at all in the abstract (False Negatives) The microRNA are likely to be specified with their fully qualified names and
perhaps not represent the target population fully.
Motivation Background Research Question Contribution Implementation References
Some Statistics
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
There are 18,314 entries in my Abstracts table◦ Of those, there are 17,231 with useable Abstracts
48% of these abstracts contain species indicators. When the abstracts finished downloading (after 2
hours) there were already 16 new abstracts available.
My database has 21,222 unique microRNA listed from MIRBase.
There are 62,036 MicroRNA with no ambiguity in the abstracts. 53% of total detections were improved by the species detection.
Motivation Background Research Question Contribution Implementation References
References
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Imig, J, Motsch, N, Zhu, JY, Barth, S, Okoniewski, M, Reineke, T, Tinguely, M, Faggioni, A, Trivedi, P, Meister, G, Renner, C & Grasser, FA 2011, 'microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma', Nucleic Acids Res, vol. 39, no. 5, Mar, pp. 1880-1893.
M. Gerner, G. Nenadic, and C. Bergman, 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.
L. J. Jensen, J. Saric, and P. Bork, “Literature mining for the biologist: from information retrieval to biological discovery," Nat Rev Genet, vol. 7, no. 2, pp. 119-129, 2006.
A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deep-sequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.
C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for disease-related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.
C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.
H. C. Wang, Y. H. Chen, H. Y. Kao, and S. J. Tsai, “Inference of transcriptional regulatory network by bootstrapping patterns”, Bioinformatics (Oxford, England), vol. 27, no. 10, pp. 1422-1428, 2011.
Motivation Background Research Question Contribution Implementation References
Questions
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Any Questions?