31
Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop http://dataminingmed.weebly.com

Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Embed Size (px)

Citation preview

Page 1: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Datamining Project: UpdateMarcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

http://dataminingmed.weebly.com

Page 2: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Recap• Data

▫ Non-homogenous datasets (Clinical Trial/Pubmed)▫ Cancer-related▫ Relations (Explicit links)

• Motivation▫ Implicit links between clinical trials and pubmed articles

may exist

• Aim▫ Provide scientists in the biological community insight into

related clinical trials and/or other publications of interest

2

Page 3: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Data Pre-Prepocessing• First Trial terms:

1. if. radiation therapy2. i),. gemcitabine/cisplatin3. weeks until. disease progression4. this. regimen5. serum levels of. interleukin-66. biliary adenocarcinomas7. adenocarcinoma treated8. post-operative adjuvant paclitaxel +

cisplatin9. phase ii trial of post-operative10.cardia receiving. post-operative

adjuvant paclitaxel11.gastro-esophageal junction or cardia

3

Page 4: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Data Pre-Prepocessing• LingPipe codes gives terms such as:

Running Stadistical Name Entity Recognizer with Training a Named Entity

Recognizer with two models: pos-en-bio-medpost.HiddenMarkovModel and

pos-en-general-brown.HiddenMarkovModel1. brain metastases2. patients undergo3. prophylactic cranial irradiation4. brain5. disease small cell lung cancer6. cranial irradiation7. health economics8. therapy vs progression9. administration

4

Refer to: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

Page 5: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Data Pre-Prepocessing• Final terms:

Trials:

1. bronchi

2. bronchial

3. bsh

4. cachexia

5. calcimimetic

6. calcium

7. hybridization

8. hydrochloride

9. hydrocortisone

10.hydroxyproline

11.hypercortisolism

5

Refer to: https://github.com/tnunes/becas-python

Pubmed:

1. abdomen

2. acc

3. acetate

4. acitretin

5. actinomycin

6. add

7. dermatitis

8. desmoid

9. desmolase

10.desmoplastic

11.detoxification

Semantic groupIdentified entity types:➢ Chemicals➢ Enzymes➢ Genes➢ Protein➢ Disease or

Syndrome➢ Anatomical

structure➢ Body System

Page 6: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Entity Extraction - TFIDF•Using textmodeler code

▫Extract entities▫Calculate TFIDF

•Examples of Features:▫“thyroid cancer” ▫“stem cell”

▫“cell lung cancer tumor cells tumor cells” X▫“arms arm arm oxaliplatin arm arm” X

•Number of Unique Entities:▫Pubmed: 1696▫Trials: 1492

6

Page 7: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Term Extraction - TFIDF•Implement Simple Code

▫Term Extraction▫TFIDF Calculation

•Examples of Features:▫“mesothelioma” ▫“adenocarcinoma”▫“neoplasia”

•Number of Unique Entities:▫Pubmed: 818▫Trials: 802

7

Page 8: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

8

ResultsFor the variance analysis, we removed the maximum threshold, we only use Minimum threshold to see if there are any improvements.

Page 9: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

9

ResultsThen we choose threshold = 0.00008 and 0.000070.

And we noticed the two ACS figures are very similar.

Page 10: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

10

ResultsThen we removed the terms with variance lower than threshold, and get the clusters before dependent clusterK=10. But after the dependent clustering, there is only one giant cluster.

Page 11: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

11

Results

Page 12: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

12

Results

Page 13: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

13

ResultsNow we use the same data set with preprocessing: we removed the terms like “and”, “or”.

Page 14: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

14

ResultsThis is the variance using the preprocessed data.

And I set the threshold to .00005, .00006, 0.00007, .00008, .00009, .0001, .00011, 0.00012, .00013, .00014, .00015, .00016, 0.00017, .00018, .00019, .0002.

And we set the threshold candidates to: 0.00003, 0.00006, 0.00008, 0.00009, 0.0001, 0.00011, 0.00012, 0.00013, .00014, .00015, .00016, .00017, .00018, .00019,.00020,.00021,.00022,.00023,.00024, 0.00025.

Page 15: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

15

Results

K=5

Page 16: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

16

Results

Page 17: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

17

Results

Before dependent clustering

Page 18: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

18

Results

After dependent clustering

Page 19: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

19

ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we used entities as the feature, then we get:

Page 20: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

20

ResultsThe clustering results before dependent clustering:

Page 21: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

21

ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we use each term as a feature, we get:

Page 22: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

22

ResultsBefore dependent clustering

Page 23: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Heterogeneous Naïve Bayes Classification

Find the Probability that a relation exists for every document in corpus B, given every document in corpus A

23

Page 24: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Corpus A

doc t1 t2 t3 t4

A rat cat cat bat

B rat rat bat

C dog dog cat

D bird bat bat dog

Z cat bird dog

Corpus B

doc t1 t2 t3 t4 t5

1 trial boy boy sick

2 trial healthy girl

3 trial cancer treatment girl

4 trial cancer brain cancer

5 trial blind boy girl girl

6 trial brain cancer blind

Relational

Doc (A) Doc (B)

A 2,4

B 1,6

C 1,2

D 4

24

Page 25: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

doc (A) class t1 t2 t3 t4

A trial rat cat cat bat

A healthy rat cat cat bat

A girl rat cat cat bat

A trial rat cat cat bat

A cancer rat cat cat bat

A brain rat cat cat bat

A cancer rat cat cat bat

B trial rat rat bat

B boy rat rat bat

B boy rat rat bat

B sick rat rat bat

B trial rat rat bat

B brain rat rat bat

B cancer rat rat bat

B blind rat rat bat

C trial dog dog cat

C boy dog dog cat

C boy dog dog cat

C sick dog dog cat

C trial dog dog cat

C healthy dog dog cat

C girl dog dog cat

D trial bird bat bat dog

D cancer bird bat bat dog

D brain bird bat bat dog

D cancer bird bat bat dog

25

Page 26: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Corpus A

doc t1 t2 t3 t4

A rat cat cat bat

B rat rat bat

C dog dog cat

D bird bat bat dog

Z cat bird dog

Corpus B

doc t1 t2 t3 t4 t5

1 trial boy boy sick

2 trial healthy girl

3 trial cancer treatment girl

4 trial cancer brain cancer

5 trial blind boy girl girl

6 trial brain cancer blind

docs doc 1 doc 2 doc 3 doc 4 doc 5 doc 6

doc A 0.001645 0.001628 0.002001 0.002752 0.001866 0.002091

doc B 0.011329 0.004655 0.007571 0.013608 0.013202 0.016166

doc C 0.010044 0.008304 0.006061 0.003992 0.011153 0.003543

doc D 0.000146 0.000146 0.000435 0.000851 0.000146 0.000561

doc Z 0.000584 0.000584 0.001033 0.001655 0.000584 0.001206

26

Page 27: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Naïve Bayes Formulation

𝑃 (𝑑𝑜𝑐1|𝑑𝑜𝑐𝐴 )∝0.001645

P(trial)∙P(trial|rat) ∙P(trial|cat) ∙P(trial|cat) ∙P(trial|bat)+ P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+

P(sick)∙P(sick|rat) ∙P(sick|cat) ∙P(sick|cat) ∙P(sick|bat)+

A rat cat cat bat

1 trial boy boy sick

27

Page 28: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Naïve Bayes Laplace Transform

• This handles better handles the terms that do not appear at all, however, we lose even more accuracy.

• This raises the question: Do we need to improve accuracy?

28

Page 29: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Naïve Bayes Accuracy

•We MAY not need to improve accuracy•We are more interested in relative ratings

4.214256697426845E-61 9.545714918515275E-62 6.375720538007726E-69 …

29

Page 30: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Naïve Bayes Future Improvement

•Improve accuracy•Improve speed•Determine criteria for predicting new links•Find out if new links improve or harm dependent clustering

30

Page 31: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

Contributions from each MemberSW/P Removal1

Non-BT Removal2

VTR3

TFIDF Terms

TFIDF Entities

DC4

DenC5

NB6

DA & DV7

Web-site8

Jessica X X X X X

Lauren X X X X X X

Marcus X

Vince X X X X

1: Stop Word/Punctuation Removal2: Non-biological term removal3: Variance Term Removal4: Dependent Clustering5: Density Clustering6: Naïve Bayes – New Algorithm7: Data Analysis & Data Visualization

Contribution

31