Upload
yahir-shelly
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
An Environment for Data Analysis in Biomedical Domain:
Information Extraction for Decision Support Systems
Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri
presented by Thiago Pardo
USP NLP Group and UFSCar Database Group, São Carlos, BR
An Environment for Data Analysis - IEA-AIE2010
http://gbd.dc.ufscar.brContext and Motivation A lot of electronic documents that report experiments
treatment adopted patients with some kind of disease number of patients enrolled in the treatment symptoms and risk factors positive and negative effects
There are several transactions and journals e.g., American Journal of Hematology, Blood, and Haematologica
06/02/102/22
An Environment for Data Analysis - IEA-AIE2010
http://gbd.dc.ufscar.brContext and Motivation Nowadays, researchers and doctors are not able
to process this huge number of documents
06/02/103/22
An Environment for Data Analysis - IEA-AIE2010
http://gbd.dc.ufscar.brContext and Motivation These documents are in unstructured format, i.e., in
plain textual form, specially in PDF
There is necessary to transform these data from unstructured to structured format in order to submit it to an automatic knowledge discovery process
06/02/104/22
http://gbd.dc.ufscar.brGoal Development of an environment called IEDSS-Bio for
analyzing data of biomedical domain, i.e., Sickle Cell Anemia
Support the expert in making decisions:
Extracting relevant information from biomedical documents
Storing the information in a data warehouse (DW)
Mining interesting knowledge from the DW
06/02/10An Environment for Data Analysis - IEA-
AIE20105/22
An Environment for Data Analysis - IEA-AIE2010
http://gbd.dc.ufscar.brContributions
Theoretical: Domain Knowledge Methodology of Information Extraction
Practical: Resources: collection of documents, dictionary
and rules Tools: Converter, Information Extraction, Data
Warehouse, Data Mining systems
06/02/106/22
http://gbd.dc.ufscar.br
The Environment for Data Analysis
An Environment for Data Analysis - IEA-AIE201006/02/10
How many patients had clinical improvement and were treated with the hydroxyurea drug?
A significant amount of patients under treatment with the hydroxyurea drug tend to have marrow depression.
7/22
http://gbd.dc.ufscar.brConverter Module
An Environment for Data Analysis - IEA-AIE201006/02/10
8/22
http://gbd.dc.ufscar.brConverter Module
An Environment for Data Analysis - IEA-AIE201006/02/10
9/22
http://gbd.dc.ufscar.brInformation Extraction Module
An Environment for Data Analysis - IEA-AIE201006/02/10
Processed Sections:
Abstract, Results and Discussion (class of positive and negative effects) All Sections (class of patient)
10/22
http://gbd.dc.ufscar.brSentence Classification
An Environment for Data Analysis - IEA-AIE2010
ML Techniques
Output
Training
Positive Effect
Negative Effect
Others
Test
Several files aboutcomplicationsentences
Several files aboutbenefitsentences
Several files aboutothersentences
New TextTXT
Set of sentences classified into classes
Cla
sses
06/02/1011/22
http://gbd.dc.ufscar.br
Identification of Relevant Information
An Environment for Data Analysis - IEA-AIE201006/02/10
Dictionary
Biomedical Database
12/22
http://gbd.dc.ufscar.br
Identification of Relevant Information
An Environment for Data Analysis - IEA-AIE201006/02/10
Identification of Information Pipeline
Example of Sentences
Relevant Information
Rules
13/22
http://gbd.dc.ufscar.br
Experiments: Sentence Classification
1. How do human beings manually perform the sentence classification?
2. Is it feasible to automate the sentence classification task?
3. What kind of classification algorithm performs better in this task?
An Environment for Data Analysis - IEA-AIE201006/02/10
14/22
http://gbd.dc.ufscar.br
Manual Classification by humans? Annotation Agreement in 50 sentences
An Environment for Data Analysis - IEA-AIE201006/02/10
)(1
)()(
EP
EPAPK
Fleiss (1971)
1
15/22
http://gbd.dc.ufscar.br
It is feasible to automate this task?
Annotator All the classes
3 experts 0.63
3 naïve subjects 0.71
experts + naïve subjects 0.65
An Environment for Data Analysis - IEA-AIE201006/02/10
Agreement ScalePoor Under 0
Slight 0 a 0.2
Fair 0.21 a 0.4Moderate 0.41 a 0.60
Substantial 0.61 a 0.80
Almost Perfect Between 0.81 and 1 Landis e Koch (1977)
2
16/22
http://gbd.dc.ufscar.br
What kind of classification algorithm performs better in this task?
An Environment for Data Analysis - IEA-AIE201006/02/10
Distribution of classes for each sample
3
17/22
http://gbd.dc.ufscar.br
An Environment for Data Analysis - IEA-AIE201006/02/10
Bag-of-words model AVM configuration:
Minimum Frequency = 2 Attributes: 1 to 3-grams
1, for the case the n-gram occurs in the sentence (present); 0 otherwise (absent).
Not considered: stopwords removal and stemming
Sentence Classification Process:training and testing phase
3
18/22
http://gbd.dc.ufscar.br
An Environment for Data Analysis - IEA-AIE201006/02/10
Evaluation3
Partitioning method: 10-fold cross-validation
19/22
An Environment for Data Analysis - IEA-AIE2010
http://gbd.dc.ufscar.brConclusions
The environment proposed – Information Extraction and Decision Support System in Biomedical domain – aims at being a general environment for mining relevant information in the
biomedical domain
First experiments on sentence classification a step of the whole process very good results (95.9% accuracy) for papers about Sickle Cell
Anemia (SCA)
Task of sentence classification in the SCA domain is well defined and possible to be automated
06/02/1020/22
http://gbd.dc.ufscar.brFuture Work
Investigate the identification of treatment and symptoms information in scientific papers
Extract of the relevant sentence pieces for populating our databases using IE approaches, e.g., rule-based and dictionary-based
Investigate the use of parallel processing to optimize the more time-consuming tasks, e.g., the application of data mining algorithms and the analytical query processing
Other biomedical areas may also benefit from our text mining approach
An Environment for Data Analysis - IEA-AIE201006/02/10
21/22
An Environment for Data Analysis in Biomedical Domain:
Information Extraction for Decision Support Systems
USP NLP Group and UFSCar Database Group, São Carlos, BR
Questions ?
http://gbd.dc.ufscar.brReferences
ANTHONY, L.; LASHKIA, G. V. Mover: a machine learning tool to assist in the reading and writing of technical papers. IEEE Transactions on Professional Communication, v. 46, n. 3, p. 185-193, 2003.
FLEISS, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, v. 76, n. 5, p. 378-382, 1971.
LANDIS, J. R.; KOCH, G. G. The measurement of observer agreement for categorical data. Biometrics, v. 33, n. 1, p. 159-174, 1977.
PINTO, A. C. S. et al. Technical Report "Sickle Cell Anemia". São Carlos: Department of Computer Science, Federal University of São Carlos, 2009. p. 16. Available at: <http://sca.dc.ufscar.br/download/files/report.sca.pdf>.
An Environment for Data Analysis - IEA-AIE201006/02/10
23/22