23
An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri presented by Thiago Pardo USP NLP Group and UFSCar Database Group, São Carlos, BR

An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

Embed Size (px)

Citation preview

Page 1: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

An Environment for Data Analysis in Biomedical Domain:

Information Extraction for Decision Support Systems

Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

presented by Thiago Pardo

USP NLP Group and UFSCar Database Group, São Carlos, BR

Page 2: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContext and Motivation A lot of electronic documents that report experiments

treatment adopted patients with some kind of disease number of patients enrolled in the treatment symptoms and risk factors positive and negative effects

There are several transactions and journals e.g., American Journal of Hematology, Blood, and Haematologica

06/02/102/22

Page 3: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContext and Motivation Nowadays, researchers and doctors are not able

to process this huge number of documents

06/02/103/22

Page 4: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContext and Motivation These documents are in unstructured format, i.e., in

plain textual form, specially in PDF

There is necessary to transform these data from unstructured to structured format in order to submit it to an automatic knowledge discovery process

06/02/104/22

Page 5: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.brGoal Development of an environment called IEDSS-Bio for

analyzing data of biomedical domain, i.e., Sickle Cell Anemia

Support the expert in making decisions:

Extracting relevant information from biomedical documents

Storing the information in a data warehouse (DW)

Mining interesting knowledge from the DW

06/02/10An Environment for Data Analysis - IEA-

AIE20105/22

Page 6: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContributions

Theoretical: Domain Knowledge Methodology of Information Extraction

Practical: Resources: collection of documents, dictionary

and rules Tools: Converter, Information Extraction, Data

Warehouse, Data Mining systems

06/02/106/22

Page 7: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

The Environment for Data Analysis

An Environment for Data Analysis - IEA-AIE201006/02/10

How many patients had clinical improvement and were treated with the hydroxyurea drug?

A significant amount of patients under treatment with the hydroxyurea drug tend to have marrow depression.

7/22

Page 8: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.brConverter Module

An Environment for Data Analysis - IEA-AIE201006/02/10

8/22

Page 9: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.brConverter Module

An Environment for Data Analysis - IEA-AIE201006/02/10

9/22

Page 10: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.brInformation Extraction Module

An Environment for Data Analysis - IEA-AIE201006/02/10

Processed Sections:

Abstract, Results and Discussion (class of positive and negative effects) All Sections (class of patient)

10/22

Page 11: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.brSentence Classification

An Environment for Data Analysis - IEA-AIE2010

ML Techniques

Output

Training

Positive Effect

Negative Effect

Others

Test

Several files aboutcomplicationsentences

Several files aboutbenefitsentences

Several files aboutothersentences

New TextTXT

Set of sentences classified into classes

Cla

sses

06/02/1011/22

Page 12: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

Identification of Relevant Information

An Environment for Data Analysis - IEA-AIE201006/02/10

Dictionary

Biomedical Database

12/22

Page 13: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

Identification of Relevant Information

An Environment for Data Analysis - IEA-AIE201006/02/10

Identification of Information Pipeline

Example of Sentences

Relevant Information

Rules

13/22

Page 14: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

Experiments: Sentence Classification

1. How do human beings manually perform the sentence classification?

2. Is it feasible to automate the sentence classification task?

3. What kind of classification algorithm performs better in this task?

An Environment for Data Analysis - IEA-AIE201006/02/10

14/22

Page 15: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

Manual Classification by humans? Annotation Agreement in 50 sentences

An Environment for Data Analysis - IEA-AIE201006/02/10

)(1

)()(

EP

EPAPK

Fleiss (1971)

1

15/22

Page 16: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

It is feasible to automate this task?

Annotator All the classes

3 experts 0.63

3 naïve subjects 0.71

experts + naïve subjects 0.65

An Environment for Data Analysis - IEA-AIE201006/02/10

Agreement ScalePoor Under 0

Slight 0 a 0.2

Fair 0.21 a 0.4Moderate 0.41 a 0.60

Substantial 0.61 a 0.80

Almost Perfect Between 0.81 and 1 Landis e Koch (1977)

2

16/22

Page 17: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

What kind of classification algorithm performs better in this task?

An Environment for Data Analysis - IEA-AIE201006/02/10

Distribution of classes for each sample

3

17/22

Page 18: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

An Environment for Data Analysis - IEA-AIE201006/02/10

Bag-of-words model AVM configuration:

Minimum Frequency = 2 Attributes: 1 to 3-grams

1, for the case the n-gram occurs in the sentence (present); 0 otherwise (absent).

Not considered: stopwords removal and stemming

Sentence Classification Process:training and testing phase

3

18/22

Page 19: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.br

An Environment for Data Analysis - IEA-AIE201006/02/10

Evaluation3

Partitioning method: 10-fold cross-validation

19/22

Page 20: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brConclusions

The environment proposed – Information Extraction and Decision Support System in Biomedical domain – aims at being a general environment for mining relevant information in the

biomedical domain

First experiments on sentence classification a step of the whole process very good results (95.9% accuracy) for papers about Sickle Cell

Anemia (SCA)

Task of sentence classification in the SCA domain is well defined and possible to be automated

06/02/1020/22

Page 21: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.brFuture Work

Investigate the identification of treatment and symptoms information in scientific papers

Extract of the relevant sentence pieces for populating our databases using IE approaches, e.g., rule-based and dictionary-based

Investigate the use of parallel processing to optimize the more time-consuming tasks, e.g., the application of data mining algorithms and the analytical query processing

Other biomedical areas may also benefit from our text mining approach

An Environment for Data Analysis - IEA-AIE201006/02/10

21/22

Page 22: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

An Environment for Data Analysis in Biomedical Domain:

Information Extraction for Decision Support Systems

USP NLP Group and UFSCar Database Group, São Carlos, BR

Questions ?

Page 23: An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

http://gbd.dc.ufscar.brReferences

ANTHONY, L.; LASHKIA, G. V. Mover: a machine learning tool to assist in the reading and writing of technical papers. IEEE Transactions on Professional Communication, v. 46, n. 3, p. 185-193, 2003.

FLEISS, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, v. 76, n. 5, p. 378-382, 1971.

LANDIS, J. R.; KOCH, G. G. The measurement of observer agreement for categorical data. Biometrics, v. 33, n. 1, p. 159-174, 1977.

PINTO, A. C. S. et al. Technical Report "Sickle Cell Anemia". São Carlos: Department of Computer Science, Federal University of São Carlos, 2009. p. 16. Available at: <http://sca.dc.ufscar.br/download/files/report.sca.pdf>.

An Environment for Data Analysis - IEA-AIE201006/02/10

23/22