26
Worldwide Safety Strategy A Hit-Miss Model for Duplicate Detection in the WHO Drug Safety Database Andrew Bate Senior Director, Analytics Team Lead, Epidemiology, Worldwide Safety Strategy Person Validation and Entity Resolution Conference Washington DC May 23, 2011

Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Embed Size (px)

DESCRIPTION

PowerPoint Presentation from May 2011 Personal Validation and Entity Resolution Conference. Presenter: Andrew Bate

Citation preview

Page 1: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

A Hit-Miss Model for Duplicate Detection in the WHO Drug Safety Database

Andrew BateSenior Director, Analytics Team Lead, Epidemiology, Worldwide Safety Strategy

Person Validation and Entity Resolution Conference

Washington DCMay 23, 2011

Page 2: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Acknowledgements

• This research was wholly funded by the WHO Collaborating Centre for International Drug Monitoring

• I was at the time an employee of the WHO Centre.• This presentation is my current opinion of the completed

research• Co-authors Niklas Norén and Roland Orre played

instrumental role in this research; Niklas Norén developed many of these slides

• For more information please contact:

[email protected]

Page 3: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Overview

• Background– Post marketing safety surveillance

– WHO Programme for International Drug Monitoring

– The problem of duplicate reports

• Method for detecting duplicate reports

• Results

• Concluding remarks

Page 4: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

The WHO International Drug Monitoring Programme, 2004

Page 5: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

The WHO International Drug Monitoring Programme• Aim to discover suspected adverse drug

reactions (ADRs) not identified in clinical trials, when drugs are on the market

• Collect reports from healthcare professionals and consumers internationally on suspected ADR incidents in clinical practice

• Run by WHO Collaborating Centre, Sweden• Analysis based on a combination of quantitative

methods for exploratory data analysis and expert clinical review

Page 6: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Spontaneous reporting limitations• Often limited clinical information on reports and satisfactory

secondary case evaluation is not always possible• Not all ADRs that occur will be recognized as drug induced by a

healthcare professional• Even those that are suspected will not necessarily be reported• Suspicion can mistakenly rest on the drug, coincidental

spontaneous ADR case reports resulting• Control information is not collected as part of spontaneously

reported systems, the drug use is not known, and there is no direct information on disease incidence

Ref Bate et al 2008 FCP

Page 7: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

The WHO database of suspected side drug effects• Strengths

– Database size (>6 million case reports, 200+ fields), now more than 1 million per year

– International coverage since 1967– Reporting of all marketed drugs from 100+ countries

• Spontaneous reporting remains the data primarily used for post-marketing identification of suspected ADRs

Page 8: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Quantitative signal detection

• Detect potential signals for further investigation that are not readily recognisable on a single case report nor otherwise readily apparent at case entry

• Enhance rather than replace other methods of signal detection– Clinical review remains critical

• Methods assume independence between reports

Page 9: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Duplicate case reports

• Unlinked case reports related to the same ADR incident: ‘duplicates’

• Duplication may be due to:– Different reporting sources (health professionals,

national authorities, different companies) having provided separate case reports related to the same incident

– Mistakes in linking follow-up case reports to the earlier database records

Page 10: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Problem extent

• Case report duplication is one of the most important data quality problems in post-marketing drug safety data, and therefore limits ADR identification capability

• There was no published research on methods for automated duplicate detection in this type of data

• No studies on how common duplicates really are (studies on vaccine ADR data suggest 5%, but for specific case series, rates around 20% have been reported)

Page 11: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Data extraction

• Hundreds of possible record fields for each case report (administrative information, incident information, patient information) but most case reports carry little information

• Anonymised data (but patient age and gender may be available)

• The following record fields are used: age, gender, country, date, drug substances, ADRs, outcome.

• Note no free text fields involved as rarely entered in the anonymized reports entered into WHO database

Page 12: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Duplicate case report characteristics

• Typically more similar than other record pairs

• Sometimes VERY different

• Great variety of discrepancies – no “safe” record fields

• Missing data can complicate things

Page 13: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Example: impact of missing data

• Consider the following two case reports:

• Likely duplicates?

• Identical case reports but too little information for the evidence to be considered strong!

Patient age Patient gender Country Drug substances ADR terms Onset date Outcome

? ? USA Bactrim Rash ? ?

? ? USA Bactrim Rash ? ?

Page 14: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

The hit-miss model (Copas & Hilton, 1990)

• Compare the probability of a certain matching event under the assumption that the two records are related, to the same probability under the assumption that they are independent

• Under additional assumption of independence between record fields, the weights for the different record fields can be added to provide an overall match score

• Hit-miss model provides model for P(x,y) – the probability for different matching events between related records

W xy logP x , yP x P y

Page 15: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Hit-miss model weights

• Matches receive positive weights (greater rewards for matches on rare events)

• Mismatches receive negative weights (greater penalties for mismatches in record field with few errors in training data)

• Record fields for which at least one of the records have missing data receive weight 0

Page 16: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Properties

• Accounts for both the level of agreement and the amount of information

• Imposes no strict criteria that a record pair must fulfil in order to be highlighted

• Allows the threshold for manual review to be adjusted based on the available resources

• Robust with respect to small amounts of training data

Page 17: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Fitting standard hit-miss modelsto the WHO database

• Model fitting is based on simple parameter estimation

• The probability for different values and for missing data in each record field can be estimated based on the data set as a whole

• The probability for a miss in a given record field needs to be estimated based on labelled duplicates (38 pairs available for the WHO database)

Page 18: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Extending the hit-miss mixture model

• A generalisation of the standard hit-miss model to numerical record fields

• In addition to hits, misses and blanks, the hit-miss mixture model includes deviations

• Motivation: many types of errors in numerical record fields are likely to lead to small differences compared to the true value, rather than to random values

Page 19: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Evaluation on Norwegian data

• The last Norwegian batch of reports from 2004 included 19 confirmed duplicates

• We used the hit-miss model to highlight suspected duplicates in this batch of 1559 case reports

• The match score threshold for likely duplicates was set to 37.5 (based on an assumed 5% duplicates in the data set and in order to achieve an estimated rate of false alarms of below 0.05)

Page 20: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Results

• 17 record pairs were highlighted as suspected duplicates

• 12 of these were confirmed duplicates, 5 were not– 5 false positives– 7 false negatives– 63% recall– 71% precision

Page 21: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Andrew Bate, UMC21

Top scoring record pair

• The highest match score in the study is for an alleged false positive

– Only near-matches on age and date, no matching ADR terms

– BUT 6 matching drug substances (not commonly co-prescribed)

– ... and ADR terms are semantically close

Age Gender Country Drug substances ADR terms Onset date Outcome Score

51 F NOR 6 matched, 1 unmatched 3 unmatched 2004-04-30 ? +76.97

50 F NOR 2004-04-20 ?

Page 22: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Follow-up

The Norwegian centre informed us that:

• The top scoring record pair does relate to a set of confirmed duplicates (submitted by different doctors in the same hospital)

• One of the other 'false positives' corresponds to a pair of suspected but yet unconfirmed duplicates

Page 23: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

23

Duplicates?

• Cluster of 3 reports highlighted in a a specific country

• Onset date: 16th Dec 2003• Age: 8, 18 and 29• All female• All had one drug listed and one AE listed –

both drug and AE quite rarely reported

Page 24: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

24

Duplicates?

• No – confirmed by reporting country• But: were all reported by the same dentist• Clearly these reports are not completely

independent• Analysis methods treat all reports as equally

important and weigh them equally– Can use duplicate detection algorithm to down

weigh very similar reports that are less likely to be ‘independent’

Page 25: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

References

Copas J, Hilton F. Record linkage: statistical models for matching

Computer records. Journal of the Royal Statistical Society: Series A 153

(1990) 287-320.

Norén GN, Orre R, Bate A. A hit-miss model for duplicate detection in the WHO drug safety database 2005. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Awarded Best Application Paper at the SIGKDD annual meeting, Chicago 2005).

Norén GN, Hopstadius J, Bate A, Star K, Edwards IR. Temporal Pattern Discovery in Electronic Patient Records. Data Mining and Knowledge Discovery, 2010. 20(3):361-387.

Page 26: Hit-Miss Model for Duplicate Detection-WHO Drug Safety Database_PVER Conf_May2011

Worldwide Safety Strategy

Conclusions

• The extended hit-miss model has several beneficial theoretical properties for this application

• Overall performance on duplicate detection in real world post-marketing drug safety data is very useful

• The hit-miss mixture model's capability to account for near-matches on age and date important in real world applications