18
Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b a School of Information Sciences, University of Tampere, Finland b School of Medicine, University of Tampere, Finland

Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

  • Upload
    emmett

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment. Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b a School of Information Sciences, University of Tampere, Finland - PowerPoint PPT Presentation

Citation preview

Page 1: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Clustering and Summarising Association Rules Mined from Phenotype, Genotype and Environmental Data Concerning Age-Related Hearing Impairment

Kati Iltanena, Sami Kiviharjua, Lida Aoa, Martti Juholaa, Ilmari Pyykköb

aSchool of Information Sciences, University of Tampere, FinlandbSchool of Medicine, University of Tampere, Finland

Page 2: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 2

Introduction

Aim of the study: to examine applicability of association rules for analysing effects of genetic and environmental factors on age-related hearing impairment (ARHI) To possibly generate new hypotheses for medical research

Association analysis Data mining approach to discover items (variable-value pairs)

frequently co-occurring in data

Association rules of the form “A → B” generated from frequent item sets

Capability to do a complete search efficiently

Page 3: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 3

Introduction

Challenge High-dimensional data result in a very large number of

association rules. Rules may be overlapping

Postprocessing is needed

Focus of the study: to develop an approach to cluster, summarise and represent association rules for easier exploration

Page 4: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 4

ARHI data

Originate from a European multicentre study on ARHI Collected in nine medical centres from seven European

countries (e.g. Van Laer et al., 2008)

2428 cases: females and males aged 53 to 67

The cases represent the best and the worst hearing thirds of their population at high frequencies (2, 4 and 8 KHz) 1241 cases with ARHI

Cases having pathologies (other than ARHI) possibly influencing hearing ability were excluded

Page 5: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 5

ARHI data

764 variables

42 phenotypes and environmental factors

Phenotypes: e.g. gender, age, body mass index, blood pressure, diabetes, cardiovascular disease and renal failure

Environmental and life style factors: e.g. use of ototoxic medication, exposure to chemicals, exposure to noise, alcohol use, and tobacco smoking

722 single nucleotide polymorphisms (SNPs) from 70 candidate genes

Page 6: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 6

Arhi rules

LHS Zhighbest>0.147

Genotype, phenotype, environmental variables

From 1 to 3 items

“Has a hearing impairment”

Zhighbest: averaged gender and age independent Z-score of high frequencies (2, 4 and 8 KHz) for the better hearing ear

0.147: a threshold value given by the expert physician

Rules were mined with Magnum Opus from RuleQuest Research.

Form for rules:

Page 7: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 7

Interestingness measures used for association rules Support

Confidence

Lift

Statistical significance: Fisher exact test

Arhi rules

)()( BAPBAs

)()( ABPBAc

)(/)()( BPABPBAl

Page 8: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 8

Clustering ARHI rules Measure of similarity or closeness between two

association rules

proportion of cases matched by both rules among cases matched by either one or both rules(a variant of a measure presented by Gupta et al., 1999)

ji

ji

jiRR

RR)R,R(s

Intersection of R22 and R26: 187 cases (Both R22 and R26 hold for 187 cases.)Union of R22 and R26: 190 casesThe similarity between R22 and R26: 187/190≈0.98

Page 9: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 9

Clustering ARHI rules

Clustering method based on graph-theoretic techniques Implemented using Matlab, Java and PostgreSQL

A connected component(a threshold of 0.3 used for the similarity measure).

Rule graph Rules - nodes Similarities between rules - weights of

edges between nodes

Similarities above chosen threshold - connections between nodes

One connected component is a rule subset or cluster. Clustering – searching for connected

components

Page 10: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 10

Rules represented in html documents Program implemented using Matlab

Rule subset information is given at different levels of details

Overall summary listing for rule subsets Number of rules, coverage, main item

Summarising rule subsets

Page 11: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 11

Summarising rule subsets

At the next level, rule subset information is enlarged with the information about the other items.

Page 12: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 12

Representing rule subsets

Gene colouring

Marking items of special interest Important SNPs from earlier studies

Ordering items in rules on the basis of item frequencies

Page 13: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 13

Representing rule subsets

Ordering rules in clusters on the basis of item frequencies

Page 14: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 14

Representing rule subsets

Page 15: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 15

Representing rule subsets

Similarities between the rules in a similarity matrix

“Solvent exposure” rules

“Noisy workplace”rules

Highly overlapping rules

Page 16: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 16

Summary statistics of ARHI rules

1-item LHS 2-item LHS 3-item LHS

Size of search space 2231 2.48535·106 1.84332·109

Minimum support threshold 50 cases 50 cases 1%

Minimum confidence threshold 60% 70% 90%

Number of rules 6 77 518

Total coverage 48.3% 86.6 % 96.5%

Support 3.4 - 13.4% 2.1 - 7.8% 1 - 2%

Confidence 60 - 67.9% 70 - 80.6% 90 - 100%

Lift 1.17 - 1.33 1.37 - 1.58 1.76 - 1.96

Common threshold values: lift 1, Fisher exact test: α = 0.01

Page 17: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 17

Conclusions

Developed approach simplified the rule exploration by grouping together

• the rules concerning the same items

• the rules concerning the same phenomenon

enabled the recognition of the overlapping rules• possibly suggesting more complex interactions

Association analysis detected factors found significant in previous studies concerning

this ARHI data enabled more exhaustive analysis of more complex patterns

• However, the problem of multiple testing has to be remembered.

gave new interesting information to the expert physician• especially rules concerning osteoporosis

Page 18: Kati Iltanen a , Sami Kiviharju a , Lida Ao a , Martti Juhola a , Ilmari Pyykkö b

Kati Iltanen, Medinfo 2013 18

References and acknowledgments

The authors are grateful to Baur M, Bille M, Bonaconsa A, Cremers CW, Demeester K, Dhooge I, Diaz-Lacava AN, Espeso A, Fransen E, Hannula S, Hendrickx JJ, Huygen PL, Huyghe J, Huyghe JR, Jensen M, Konings A, Kremer H, Kunst S, Lacava A, Lemkens N, Manninen M, Mazzoli M, Mäki-Torkko E, Orzan E, Parving A, Pawelczyk M, Pfister M, Rajkowska E, Sliwinska-Kowalska M, Sorri M, Steffens M, Stephens D, Topsakal V, Tropitzsch A, Van Camp G, Van de Heyning PH, Van Eyken E, Van Laer L, Verbruggen K, and Wienker TF, for the possibility to use the ARHI data.

Acknowledgments

References

Gupta et al., Distance based clustering of association rules In: Intelligent Engineering Systems Through Artificial Neural Networks (Proceedings of ANNIE 1999), ASME Press, 1999, pp. 759-764.

Van Laer et al., The grainyhead like 2 gene (GRHL2) alias TFCP2L3, is associated with age-related hearing impairment. Hum Mol Genet 2008: 15: 159-69.